资源描述:
《a novel traffic analysis for identifying search fields in the long tail of web sites -2010》由会员上传分享,免费在线阅读,更多相关内容在行业资料-天天文库。
1、ANovelTrafficAnalysisforIdentifyingSearchFieldsintheLongTailofWebSitesGeorgeForman,EvanKirshenbaum,ShyamsundarRajaramHPLaboratoriesHPL-2010-27Keyword(s):webdatamining,clickstreamanalysis,machinelearningclassification,activelearningAbstract:Usingaclickstreamsample
2、of2billionURLsfrommanythousandvolunteerWebusers,wewishtoanalyzetypicalusageofkeywordsearchesacrosstheWeb.Inordertodothis,weneedtobeabletodeterminewhetheragivenURLrepresentsakeywordsearchand,ifso,whichfieldcontainsthequery.Althoughitiseasytorecognize'q'asthequeryf
3、ieldin'http://www.google.com/search?hl=en&q=music',wemustdothisautomaticallyforthelongtailofdiversewebsites.Thisproblemisthefocusofthispaper.Sincethenames,typesandnumberoffieldsdifferacrosssites,thisdoesnotconformtotraditionaltextclassificationortomulti-classprob
4、lemformulations.Theproblemalsoexhibitshighlynon-uniformimportanceacrosswebsites,sincetrafficfollowsaZipfdistribution.Wedevelopedasolutionbasedonmanuallyidentifyingthequeryfieldsonthemostpopularsites,followedbyanadaptationofmachinelearningfortherest.Itinvolvesanin
5、terestingcase-instancesstructure:labelingeachwebsitecaseusuallyinvolvesselectingatmostoneofthefieldinstancesaspositive,basedonseeingsamplefieldvalues.Thisproblemstructureandsoftconstraint-whichwebelievehasbroaderapplicability-canbeusedtogreatlyreducethemanuallabe
6、lingeffort.WeemployedactivelearningandjudiciousGUIpresentationtoefficientlytrainaclassifierwithaccuracyestimatedat96%,beatingseveralbaselinealternatives.ExternalPostingDate:February21,2010[Fulltext]ApprovedforExternalPublicationInternalPostingDate:February21,2010
7、[Fulltext]Tobepublishedandpresentedatthe19thInternationalWorldWideWebConference(WWW2010).Raleigh,NC.April26-30.2010.http://www2010.org©Copyrightthe19thInternationalWorldWideWebConference(WWW2010).ANovelTrafficAnalysisforIdentifyingSearchFieldsintheLongTailofWebSit
8、esGeorgeFormanEvanKirshenbaumShyamsundarRajaramghforman@hpl.hp.comevan.kirshenbaum@hp.comshyam.rajaram@hp.comHPLabs1501PageMillRd.PaloAlto,CA,94304USAABSTRACT1