资源描述:
《外文翻译--网络爬虫.doc》由会员上传分享,免费在线阅读,更多相关内容在教育资源-天天文库。
1、Crawlingthewebisdeceptivelysimple:thebasicalgorithmis(a)Fetchapage(b)ParseittoextractalllinkedURLs(c)ForalltheURLsnotseenbefore,repeat(a)–(c).However,thesizeoftheweb(estimatedatover4billionpages)anditsrateofchange(estimatedat7%perweek)movethisplanfromatriv
2、ialprogrammingexercisetoaseriousalgorithmicandsystemdesignchallenge.Indeed,thesetwofactorsaloneimplythatforareasonablyfreshandcompletecrawloftheweb,step(a)mustbeexecutedaboutathousandtimespersecond,andthusthemembershiptest(c)mustbedonewellovertenthousandti
3、mespersecondagainstasettoolargetostoreinmainmemory.Thisrequiresadistributedarchitecture,whichfurthercomplicatesthemembershiptest.Acrucialwaytospeedupthetestistocache,thatis,tostoreinmainmemorya(dynamic)subsetofthe“seen”URLs.Themaingoalofthispaperistocarefu
4、llyinvestigateseveralURLcachingtechniquesforwebcrawling.Weconsiderbothpracticalalgorithms:randomreplacement,staticcache,LRU,andCLOCK,andtheoreticallimits:clairvoyantcachingandinfinitecache.Weperformedabout1,800simulationsusingthesealgorithmswithvariouscach
5、esizes,usingactuallogdataextractedfromamassive33daywebcrawlthatissuedoveronebillionHTTPrequests.Ourmainconclusionisthatcachingisveryeffective–inoursetup,acacheofroughly50,000entriescanachieveahitrateofalmost80%.Interestingly,thiscachesizefallsatacriticalpo
6、int:asubstantiallysmallercacheismuchlesseffectivewhileasubstantiallylargercachebringslittleadditionalbenefit.Weconjecturethatsuchcriticalpointsareinherenttoourproblemandventureanexplanationforthisphenomenon.1.INTRODUCTIONArecentPewFoundationstudy[31]states
7、that“SearchengineshavebecomeanindispensableutilityforInternetusers”andestimatesthatasofmid-2002,slightlyover50%ofallAmericanshaveusedwebsearchtofindinformation.Hence,thetechnologythatpowerswebsearchisofenormouspracticalinterest.Inthispaper,weconcentrateono
8、neaspectofthesearchtechnology,namelytheprocessofcollectingwebpagesthateventuallyconstitutethesearchenginecorpus.Searchenginescollectpagesinmanyways,amongthemdirectURLsubmission,paidinclusion,andURLextractionf