资源描述:
《wire an open source web information retrieval environmentnew》由会员上传分享,免费在线阅读,更多相关内容在教育资源-天天文库。
1、WIRE:anOpenSourceWebInformationRetrievalEnvironmentCarlosCastilloRicardoBaeza-YatesCenterforWebResearch,UniversidaddeChileCenterforWebResearch,UniversidaddeChilecarlos.castillo@upf.eduricardo.baeza@upf.eduAbstractInthispaper,wedescribetheWIRE(WebInformationRetrie
2、valEnvironment)projectandfocusonsomedetailsofitscrawlercomponent.TheWIREcrawlerisascal-able,highlyconfigurable,highperformance,open-sourceWebcrawlerwhichwehaveusedtostudythecharacteris-ticsoflargeWebcollections.1.IntroductionFigure1.Someofthepossiblesub-projectsof
3、WIRE,highlightingthecompletedparts.AttheCenterforWebResearch(http://www.cwr.cl/)wearedevelopingasoft-waresuiteforresearchinWebInformationRetrieval,whichwehavecalledWIRE(WebInformationRetrievalEnvironment).OuraimistostudytheproblemsofWebsearchbycreatinganefficients
4、earchengine.Searchofdocuments(e.g.:processsomedatastructuresondiskenginesplayakeyroleontheWeb,assearchingcurrentlyinsteadofinmainmemory).Currently,thecrawlerispar-generatesmorethan13%ofthetraffictoWebsites[1].allelizable,butunlike[8],ithasacentralpointofcontrol.Fu
5、rthermore,40%oftheusersarrivingtoaWebsiteforthefirsttimeclickedalinkfromasearchengine'sresults[14].Configurableandopen-source:Mostoftheparame-TheWIREsoftwaresuitegeneratedseveralsub-projects,tersforcrawlingandindexingcanbeconfigured,includingincludingsomeofthemodule
6、sdepictedinFigure1.Soseveralschedulingpolicies.Also,alltheprogramsandthefar,wehavedevelopedanefficientgeneral-purposeWebcodearefreelyavailableundertheGPLlicense.crawler[6],aformatforstoringtheWebcollection,atoolforextractingstatisticsfromthecollectionandgenerating
7、Thedetailsaboutcommercialsearchenginesareusu-reportsandasearchenginebasedonSWISH-EusingPage-allykeptasbusinesssecrets,butthereareafewRankwithnon-uniformnormalization[3].examplesofopen-sourceWebcrawlers,forinstanceInsomesense,oursystemisaimedataspecificsegment:Nutc
8、hhttp://lucene.apache.org/nutch/.Ourourobjectivewastouseittodownloadandanalyzecollec-systemisdesignedtofocusmoreonevaluatingpagequal-tionshavingintheorderof106