abstract architectural design and evaluation of an efficient web-crawling system

abstract architectural design and evaluation of an efficient web-crawling system

ID:34043864

大小:195.49 KB

页数:9页

时间:2019-03-03

上传者:xinshengwencai
abstract architectural design and evaluation of an efficient web-crawling system_第1页
abstract architectural design and evaluation of an efficient web-crawling system_第2页
abstract architectural design and evaluation of an efficient web-crawling system_第3页
abstract architectural design and evaluation of an efficient web-crawling system_第4页
abstract architectural design and evaluation of an efficient web-crawling system_第5页
资源描述:

《abstract architectural design and evaluation of an efficient web-crawling system》由会员上传分享,免费在线阅读,更多相关内容在教育资源-天天文库

TheJournalofSystemsandSoftware60(2002)185–193www.elsevier.com/locate/jssArchitecturaldesignandevaluationofanefficientWeb-crawlingsystemHongfeiYan*,JianyongWang,XiaomingLi,LinGuoComputerNetworksandDistributedSystemsLaboratory,DepartmentofComputerScienceandTechnology,PekingUniversity,Beijing100871,PRChinaReceived1June2000;receivedinrevisedform1December2000;accepted1March2001AbstractThispaperpresentsanarchitecturaldesignandevaluationresultofanefficientWeb-crawlingsystem.Thedesigninvolvesafullydistributedarchitecture,aURLallocatingalgorithm,andamethodtoassuresystemscalabilityanddynamicreconfigurability.Simulationexperimentshowsthatloadbalance,scalabilityandefficiencycanbeachievedinthesystem.CurrentlythisdistributedWeb-crawlingsubsystemhasbeensuccessfullyintegratedwithWebGather,awell-knownChineseandEnglishWebsearchengine,aimedatcollectingalltheWebpagesinChinaandkeepingpacewiththerapidgrowthofChineseWebinformation.Inaddition,webelievethatthedesigncanalsobeusefulinothercontextsuchasdigitallibrary,etc.Ó2002ElsevierScienceInc.Allrightsreserved.Keywords:WorldWideWeb;Web-crawling;Scalability;Reconfigurability;Searchengine1.Introductionthan30,000querieseveryday,adoptedacentralizedarchitecturetocollectWebpages(i.e.,asinglemainDuringtheshorthistoryoftheWorldWideWebprocesscoordinatesmanycrawlerstoworkinparallel),(Web),internetresourcesgrowdaybydayandtheandonemillionpageindicesaremaintainedafterthenumberofhomepagesincreasesrapidly.Howtopagesarecrawledandparsed.WiththecapabilityofquicklyandaccuratelyfindwhatyouneedintheWeb?crawling100,000pagesaday,WebGather1.0takesSearchengineisausefultoolanditbecomesmoreandabouttendaystorefreshthewholeWebpagesithosts.moreimportant.ThenumberofindexedpagesisaWenotethat(Google,2000),bornofStanfordUni-primaryindicatorforthecapabilityofasearchengine.versity,couldindex560millionpagesinJuly2000Byindexingalargernumberofpagesvisited,asearch(Sullivan,2000).Thecentralizedversion,WebGatherengineisabletosatisfyusers’requestsinabetterlight.1.0,isincompetenttoupdatethedatabaseinareason-Besides,anotherimportantfactorforaWebsearchableperiodoftime.Forexample,withthecrawlingsystemisthefreshnessofthepagesitindexes.WehopeacapabilityofWebGather1.0,itwilltake100daystosearchenginereflectsthechangesintheWebinatimelycollect10millionpages.Becausepagesareoftenre-way.Morespecifically,asearchengineshouldbeabletofreshed,someofthecollectedpageswilllosetheirvalue.collectenoughpagesinalimitedtimeframe,say30Ofcourse,itisquitelikelytoacceleratetheperformancedays.Thus,theefficiencyincollectingpagesisessentialofthesystembyimprovingcrawlingalgorithm,adopt-foraqualitysearchengine.ingmorepowerfulmachinesandhighernetworkband-Itisnaturaltothinkof‘‘parallelprocessing’’orwidth.DuetotheexponentialincreaseofWebpages,‘‘paralleldistributedsystem’’whentalkingabouteffi-theaboveapproachesarenotgoodenoughtocopewithcientlyexecutingtaskswithlargedataset.Previously,theever-increasingpages.Soadoptingparallelprocess-WebGather1.0(Liuetal.,2000),whichanswersmoreingtechnologytocollectmorepagesinalimitedtimeframeisessentialindevelopingalarge-scalesearchengine.*ThispaperprimarilyconcernsdesignofaparallelandCorrespondingauthor.Tel.:+86-10-62751797e8001;fax:+86-10-62765813.distributedschemetoachievethedesigngoal.WewillE-mailaddress:yhf@net.cs.pku.edu.cn(H.Yan).presentanarchitectureandproposemethodsofcol-0164-1212/02/$-seefrontmatterÓ2002ElsevierScienceInc.Allrightsreserved.PII:S0164-1212(01)00091-7 186H.Yanetal./TheJournalofSystemsandSoftware60(2002)185–193lectingpagesontheWebforadistributedsearchenginehastwodistinctsystemfeatures.Foronething,itmakessystem.BasedonWebGather1.0anditslogdata,wefulluseoflinkstructuretoimproverankingquality.Onhavedesignedandimplementedanexperimentmodeltotheotherhand,itextractsinformationfromanchortextvalidatethearchitecture,designideasandmethods.ThethatcanhelpindextheWebpagesthathavenotactuallyresultisbeingusedintheconstructionofWebGatherbeencrawled.Thissecondfeaturegivesitcoverageof2.0.over1.2billionWebpages(Sullivan,2000).Neverthe-less,itadoptsacentralizedarchitectureforpagecrawling.Althoughmultiplecrawlerscanbedispatched,2.RelatedworkitsURLservertendstobecomeapotentialcentralizedperformancebottleneck,whichimpedesitsscalability.2.1.Harvest:atypicaldistributedarchitectureBasedonthepaper(BrinandPage,1998),welearnthatGooglehasonlyoneURLserverwhichsendslistsofHarvest(Bowmanetal.,1995)isatypicalsystemURLstobefetchedtothecrawlers.TheURLserverisathatmakesuseofdistributedmethodstocollectandsinglepointoffailure,soifitcrashes,theentiresystemindexWebpages.Harvestismadeofseveralsubsys-maygodown.Inaddition,inalargesystem,acentral-tems.TheGatherersubsystemcollectsindexinginfor-izedcomponentlikeURLservermaybecomeaperfor-mation(suchaskeywords,authornames,andtitles,mancebottleneck.etc.)fromtheresourcesavailableatProvidersites(suchasFTPandHTTPservers).TheBrokersubsystem2.3.OurworkretrievesindexinginformationfromoneormoreGatherers,eliminatesduplicateinformation,incremen-Integratedwithcharacteristicsofsearchenginesandtallyindexesthecollectedinformation,andprovidesabasedonWebGather1.0,whichusesacentralizedar-Webqueryinterfacetoit.TheReplicatorsubsystemchitecture,wedesignedanarchitectureandstrategytoefficientlyreplicatesBrokersaroundtheWeb.UserscollectWebpages.ThisdesignisbeingusedinthecanefficientlyretrievelocatedinformationthroughtheconstructionofWebGather2.0,whichemploysadis-Cachesubsystem.TheHarvestServerRegistryisatributedarchitecture.Basedonsimulationdata,wedistinguishedBrokerthatholdsinformationabouteachfoundthatourmethodcanavoidthehighadministra-HarvestGatherer,Broker,Cache,andReplicatorinthetioncostsofsettinguplargenumberofinstallationsofaWeb.HarvestprovidesadistributedarchitecturetodistributedsystemlikeHarvestandovercomethedis-gatherandsearchinformationontheWeb,anditisadvantagesofacentralizedsystem.Thenweimple-worthstudyingandlearning.However,Harvestisamentedourstrategiesontheactualsystem.OurfinalcomplicatedsystemwithcomplexalgorithmsandhugegoalistocollectalltheWebpagesonChinaandkeepexpenseswhichhinderitspopularity.Asfarascol-pacewiththerapidgrowthofChineseWebinformationlectingWebpagesisconcerned,Harvestdoesnotad-usingtheversion2.0ofWebGather.Thearchitecturedresssomeimportantissuesthatalarge-scalesearchdescribedhereisnotonlysuitablefordesigningandenginemustface.implementingasearchengine,butalsousefulinbuilding1.Asearchenginehastomeetcertainrequirementonadigitallibrary.therateofcollectingWebpages.Harvestdoesnotconsiderthisaspect.2.TheGathererofHarvestwillhavegoodeffectifit3.AmodelforadistributedWeb-crawlingsystemrunsonthemachinesofProviders.However,itisim-possibletomakeeveryProvidertodoso.3.1.Terminology3.AGathererwilldiscardURLsthatcouldnotbevis-itedbyitself.However,otherGatherersmaybeableForclarity,wefirstdefinethespecialtermsusedintovisittheseURLs.SoHarvestsystemdoesnotre-thefollowingsections.solvehowtousecrossURLs.Main-controlleristheprocessthatmanagesmultiple4.HarvesthaslesseffectivecontrolsonitsGatherersgathererstofetchpagesfromtheWeb.Itsmainfunctionwhenitisusedtocollectinformationwithinapartic-istoassignURLstogatherers,andsaveabstractsofularscope.Forexample,itshouldbedemandedtoWebagesreturnedbygatherers.Itisthecorepartofabidebythe‘‘WebRobot’’protocolandhavesomeWebGatherandeachworkstationrunsonemain-con-guidancetocrawling.trollerinourdistributedsystem.Coordinatoristhemodulethatcoordinatesallmain-2.2.Google:atypicalcentralizedarchitecturecontrollersinthedistributedsystem.Itrunsononeoftheworkstations.Google,whichhasindexed602millionWebpages,isCrossURListhereferenceURLthatpointstootheroneofthemostpowerfulsearchenginesintheworld.ItWebpageinavisitedWebpage. H.Yanetal./TheJournalofSystemsandSoftware60(2002)185–1931873.2.Designgoals2.Thesecondstrategyistohavethemain-controllersinonelocation,connectedviaaLAN.Withthisar-BasedonIPblocksallocatedinsideChina(CERNIC,rangement,IPaddressesaredividedintodifferent2000),WebGather’sgoalistocollectalltheWebpagesparts,byhandlingeveryURLwithhashfunction,inChina.andthenallocatethemtomain-controllers.SoeveryAccordingtoastatisticalreport(CNNIC,2000),main-controlleronlychargesincollectingURLsChinahasabout1500WebsitesinOctober1997,aboutwithinitsownscope.Whenamain-controllergetsa3700inJuly1998,about5300inJanuary1999,aboutcrossURLnotbelongingtoitsownscope,toavoid9906inJuly1999,about15,153inJanuary2000,losinginformation,themain-controllershouldtrans-about27,289inJuly2000.AnditisexpectedtobemittheURLtoamain-controllerwhichisresponsi-40,000WebsitesinDecember2000.bleforit.Eachmain-controllergetsURLsthroughInktomiandtheNECResearchInstitute,havethehashfunctionH(URL)¼(DNS(URL’shostcompletedanewstudythatverifiestheWebhasgrownpart))MODn,inwhichndenotesthenumberoftohavemorethanonebillionuniquepagesdistributedmain-controllers,andDNS(URL’shostpart)denoteson4,217,324Websites(Inktomi,2000).TheneverysitethesumofintegerpartsofIPwhichcomesfromthewillhostabout238uniquepagesinaverage.Thenum-resolutionofthehostpartofaURL,ortheintegerberisrelativelystable.Therefore,WebGatherneedstodirectlyfromaURLstringtransformingwithoutcollectabout9,520,000Webpages.Takingtendaysasaresolution.crawlingcycle,withthecurrentcollectingrateofWeb-Gather,itneedsatleasttenworkstationstocooperate.3.3.2.Communicationstrategiesamongmain-controllersBesidestheabovegoal,weexpectthedistributedsystemWeconsideredthefollowingtwokindsofcommu-tohavethefollowingcharacteristics.nicationstrategiesamongmain-controllers:1.Dynamicloadbalance,i.e.,eachworkstationtakes1.Ringconnection:thereisaninterconnectionbetweenaboutthesameworkloadintheruntime.twoadjacentmain-controllers,formingaring.Cross2.Lowamountofcommunicationbetweenmain-con-URLs’transmissioncanbeclockwiseoranti-clock-trollers,reducingtrafficamongtheworkstations.wise.3.Highscalability,i.e.,tocertainrange,theperfor-2.Completeconnection:thereisdirectconnectionbe-mance(numberofpagesthatcanbefetchedaday)tweenanytwomain-controllers,formingafullycon-scalesasweaddmoreworkstations(andmain-con-nectedgraph.CrossURLs’transmissioncanbetrollers)tothesystem.directlycompletedfromonetoanother.4.Dynamicreconfigurability,i.e.,tobeabletoaddorremovemain-controllersintheruntime.Thisis3.3.3.Analyzingdistributionandcommunicationstrate-usefulinthecaseoffailureofworkstations,sincegiesthegatheringprocessofarealsearchenginetakesTalkingaboutdistributionstrategies,theprerequi-daystocompleteandthefailuresometimesissiteforthefirstdistributionmethodistoallocateallunavoidable.theWebsitesinChinatodifferentmain-controllersoptimally.ItmeanstoallocateWebsitesintermsof3.3.ThearchitectureandprimarydesignissuesandthecommunicationtimebetweenaWebsiteandanyapproachestothemmain-controller.First,weshouldcollectWebsitesasmanyaspossible,andassignthemtoeachmain-con-Toachieveourgoal,weconsidertwokindsofdis-trollerbythecoordinator.Aftermeasuringcommuni-tributionstrategiesandtwokindsofcommunicationcationtime,were-allocatetheWebsitesagainstrategies.Afteranalyzingtheirmeritanddefect,wegetaccordingtotheresult.Atthattimethesystemisinanthefinaldecision.optimizedstate.Second,whenmain-controllersbeginworking,ifamain-controllerfindsanewWebsite,it3.3.1.DistributionstrategiesshouldtransmittheWebsitetothecoordinatorwhichSincetheperformancebottleneckofasearchenginechargesinallocatingthenewWebsitetoeverymain-liesinthenetworkbandwidthandthecapacityofacontrollerintermsofmeasuringcommunicationtime,singleworkstation,distributedarchitectureshouldbethenthemain-controllerwiththeshortesttimegetstheintroduced.ThefollowingtwokindsofdistributionprivilegeofcollectingtheWebsite.However,becausestrategiescanbeconsidered.allmain-controllersareplacedindifferentplaces,alot1.Main-controllersarephysicallydistributedingeo-oftroublewillcome,suchasdebuggingandmain-graphicallydifferentplaces.Supposeadistributedtainingthesubsystems,analyzingdataaftercollectionsystemincludingfourmain-controllers,forChina,andsoon.ThesecondmethodisalittleslowerthanandtheycanbeplacedinShengyang,Beijing,Wuhanthefirstoneincollectingspeed,butthesecondisandGuangzhou.simplewheninitializingthesystemanddoesnothave 188H.Yanetal./TheJournalofSystemsandSoftware60(2002)185–193anyshortcomingsofthefirst.Additionally,through3.5.Keytechniquesandtheirrolesinthesystemreasonabletuninghashfunctionthataffectstheallo-catingofURLs,thesecondstrategycanapproachthe1.UsehashtablestructuretostoremassURLdata.speedofthefirst.WebGatherusesInformixdatabasetostorethevis-Asfarascommunicationstrategiesareconcerned,theitedWebpages.Intheexperiment,toimproveeffi-secondstrategyisbetterthanthefirstone.Inthefirstciency,wetakehashtablestructuretoproduceastrategy,eventhoughitissimplewhilebeinginitialized,fileinsteadofadatabasetostoreURLs.InthistherearepossibilitiesoftransmittingcrossURLstimeway,themain-controllersonlyaccessdatabaseinaftertime.So,abigamountofcommunicationisafatalalimitedresource.However,itsdrawbackistheshortcoming.Thesecondstrategyofmain-controllerspotentialinconsistencyofdataafterthemachinecommunicationhasanevidentadvantage:transmittingfails.Itneedsvastworktomaintainthedatacon-URLsquicklyduetothewholedistributedsysteminasistency.Fromtheviewofdataconsistency,com-LAN,andachievingloadbalanceeasilybecauseanymercialdatabaseismorereliable.However,ifthetwomain-controllershaveconnectionsandanychangessystem’ssaferunningisguaranteed,aneconomicalofthewholedistributedsystemcanbeimmediatelyde-andefficientwayistousefiledirectlyinsteadoftected.database.2.Useuniformdomainnameresolutiontoensurecon-sistencyintheexperimentenvironment.Sinceitis3.4.AnanalysisofthearchitecturenormalforsomedomainnamesandWebpagestobechangedduringparsingdomainnames,itwillleadWeadopttheseconddistributionstrategyandthetodifferentresultseventhoughyourunthesamesecondmain-controllerscommunicationstrategyingroupexperiment.Inaddition,itwillbeveryslowWebGather2.0.Fig.1showsthesystemarchitecture.toparseadomainnameifitnolongerexists.SoweWebGatherserverregistry(WSR)isthecoordinatorparseallofthematonetimeandstoretheminafile.modulewhichstoresinformationincludingIPsandEachexperimenttakesthesamedataset,thusthere-portsofallregisteredmain-controllersinthesystem.sultsaremeaningfulforcomparing.Whenanymain-controller’sstatehaschanged,WSR3.TheWSRmoduleinthedistributedsystemmakeswilldelivertheupdatedinformationtoothermain-sureeachmain-controllerholdingthenewestandcontrollers.Everymain-controllerisinchargeofconsistentinformationinthesystem.Itisaprerequi-collectingWebpagesinitsownscope.Eachgatherersitetomakethesystemfeasibleandre-configurable.belongstosomecorrespondingmain-controller.Itre-ceivesURLsfromitsmain-controller,crawlstheWeb3.6.Dynamicreconfigurability:anenhancementtothepagespointedbytheURLsandtransmitsthecontenttomodelitsmain-controller.Whenanymain-controllerfindscrossURLsinthecontent,itshouldsendthemtotheTheexistenceofthemoduleofWSR(seethedetailincorrespondingmain-controllers.ToreducetheamountthedescriptionofFig.1)makesdynamicre-configura-ofcommunication,main-controllersonlysendURLstionofthesystempossible,whichguaranteeshighamongeachother.availabilityandscalability.Undertheconditionofmaintainingtheloadbalanceofthesystem,weconsiderthreefeasibleURL-allocationmethods.1.UsehashfunctiontodynamicallyallocateURLs.2.Onthebasisofthefirstmethod,eachmain-controllermaintainsatableofWebsites.Tablesareidenticalamongdifferentmain-controllers.EveryrecordinthetablecontainsaWebsite(IP)andthecorrespond-ingmain-controllercomputer’sinformation.3.Useatwo-stagelogicalmapping.First,wemapURLsintoalogicaltablebyahashfunction,andthenmapcertainpartsofthelogicaltableintodiffer-entmain-controllers.Bycomparingtheperformancesofthethreemethodswhenaddingorsubtractingonemain-controller,wecanknowwhichisthebest.LetthenumberofWebsitesbeM,theinitialnumberofmain-controllersbeN,NiandNjrepresentthetwoarbitrarymain-controllers,respec-Fig.1.ThedistributedWebGatherarchitecture.tively,NNþ1representstheconditionofaddingone H.Yanetal./TheJournalofSystemsandSoftware60(2002)185–193189main-controllerandNN1representstheconditionofsubtractingonemain-controller.Letusfirstlookatthefirstmethod.Afterthesystemisinitialized,eachmain-controllerisresponsibleforM=NWebsites.ThehashfunctionishðxÞ¼xMODN,(wherexisthesumofintegerpartsofIPorotherresultsobtainedthroughothermethods).Sotheloadbalanceofthesystemcanbeguaranteed.Wheneitheraddingorsubtractingseveralmain-controllers,Nwillchange.Asaresult,theURLspreviouslybelongingtoNimaybeal-locatedtoNj,whichwillleadtocollectsomeWebpagesrepeatedly.Toovercomethedrawbackofthefirstmethod,inthesecondmethodeachmain-controllermaintainstwoex-tratables–WebsitestableandvisitedURLtable.AlloftheWebsitestablesareidenticalbutthevisitedURLtablesaredifferentfordifferentmain-controllers.Be-causetherearelimitedWebsites,thejoinofamain-controllermayhavenotenoughWebsitestocrawl.Tomaintainloadbalanceofthesystem,wehavetoshiftsomeWebsitesplusthecorrespondinginformationinthevisitedURLtablefromtheexistingmain-controllerstothenewone.Undersuchcondition,thereisanextrastepneededaftercalculatingaURLbythehashfunc-Fig.2.Two-stagelogicalmappingofURLs:(a)theinitialstateofthetion.(1)Whenamain-controllerknowsthataURLlogicalarray;(b)thestateafteraddingonemain-controller;(c)theshouldbecrawledbyitself,itmustfirstjudgewhetherstateafterremovingonemain-controller.theIPtowhichtheURLbelongshasbeencrawledbyanothermain-controlleraccordingtotheWebsitestable,andifnot,itcandoitswork.(2)Whenamain-A[45001],A[45002];...;A[50000]tothe10thmain-controllerknowsthataURLshouldbecrawledbycontroller.anothermain-controllersay,controllerA,itmustalsoWhenaddingamain-controller,everyexistingmain-judgeifthecorrespondingIPhasbeencrawledbyacontrollershouldgivepartofitslogicalnodestothemain-controllerotherthancontrollerA,andifnot,itnewone.SopartsofAmustbechanged.Intheex-willcontinuetosendtheURLtocontrolA.Inthisample,thechangeisshownincentralpartofFig.2:n11methodwemustmaintaintheconsistencyoftheWebismadeupofwhatarechosenfromn1;n2;...;n10,re-sitestablewithindifferentmain-controllersandtransferspectively.ThecorrespondingelementsinarrayAwillsomeinformationaboutWebsitesandcorrespondingbesetto11.visitedURLswhenthenumberofmain-controllersAndwhenremovingamain-controllerfromthesys-changes.Asaresult,theamountofcommunicationtem,theremovedmain-controllershouldshareitsWebamongmain-controllerswillincrease.Wehavedecidedsitesaveragelytotheothermain-controllers.Intheex-tocalculateURLsusingalgorithmMD5,theneachample,thechangeisshowninrightpartofFig.2:n1isURLonlytakes16bytes.expandedtoincludewhatisshiftedfromn10andtheOurthirdmethodusestwo-stagelogicalmappingofcorrespondingelementsintheshiftedpartwillbesettoURLs.Hereweusearraysay,A,tostorethelogical1,n2;...;n9dothesamething,respectively.nodes.Eacharrayelement’ssubscriptrepresentsthisComparedwiththesecondone,thethirdmethodlogicnode’ssequencenumberanditscorrespondingaddsamappinganditalsoshouldmaintainthevisitedsequencenumberofmain-controllerisstoredinthisURLstableandshouldshiftsomeofthetableitemstoarrayelement.Forexample,assumeN¼10(n1asmain-othermain-controllerswhenthenumberofmain-con-controller1,n2asmain-controller2;...;n10asmain-trollerschanges.Butinthismethod,itisnotnecessarycontroller10)andM¼50;000,MisthenumberoftostoretheWebsitestable.OnlyalogicalarrayAislogicalnodes.TheinitialstateofAisshowninleftpartstoredbyeachmain-controller.Asaresult,theamountofFig.2,A[1],A[2];...;A[50000]arecalledlogicalofcommunicationamongmain-controllersdecreases.nodes.FirstwemaptheURLtothelogicalnodesbyAndtherewillnotbetheextrastepaftercalculatinghashfunction,theninthesecondstagemapA[1],URLsbythehashfunctionasinthesecondmethod.A[2];...;A[5000]tothe1stmain-controller,A[5001],Now,becausethefirstmethodissimple,weuseitinA[5002];...;A[10000]tothe2ndmain-controller,thesimulationsystem.Butthethirdmethodexcelsthe 190H.Yanetal./TheJournalofSystemsandSoftware60(2002)185–193othertwo.SowewillusethethirdmethodinWeb-0xkGather2.0.Itwillguaranteethegoodscalabilityofthexk¼P1;j¼1;2;...ð3Þj¼1xjWeb-crawlingsubsystem.Explaintheabove(1)and(2)formulas:Xisarandomvariable(ithasfinitenumber),distributionofXis4.SimulationresultsPfX¼xig¼pi,wherei¼1;2;...n.Xcanbeanyvalueofx1;x2;...;xn.ProbabilityofX’svalueisp1;p2;...;pn.InJun2000,whileWebGatherisrunning,weutilizeaFormula(1)isusedtocomputeX’smeanvalue,andprogramtogetsimulativedatawhichare507megabytesFormula(2)isusedtocomputeX’svariance.includingWebpageURLsandcrossURLs.AfterTocomparetheresults,wenormalizedfourgrouprunningtheprogram,wegetsimulativeWebdatawithexperimentaldatausingformula(3).761,129Webpages.Thedataisourexperimentobject.Afternormalizingresultsofthefourgroupsofex-AllofourmeasurementsweremadeonageneralIntelperimentaldata,weuseformulas(1)and(2)tocomputePCwithtwo550MHzIntelprocessors,512megabytesthevariances,asshowninTable1.ofmemoryand36-gigabyteharddisk.TheoperatingWecanseefromTable1that,whentherearetwo,systemisSolaris8.0.four,eight,orsixteenmain-controllers,thevariancesareBasedontheaboveenvironmentforexperiment,weallsmall.Thatistosayeachmain-controllerisre-separatelytestedfourcasesbyvaryingthenumberofsponsibleforaboutthesamesizeofWebpageset.Thus,main-controllers(e.g.,2,4,8and16).Fourexperimentstheexpectedgoalofloadbalanceofthedistributedaredoneindependently.Acentralizedmain-controllersystemisachieved.runsalongwiththedistributedsystemconsistingofnmain-controllers.Everygrouptakesatleastthreedays4.2.Amountofcommunicationbetweenmain-controllerstofinishtesting.Wegotalargesetofresultsfromtheexperiment.OnethingthatshouldbepointedoutisthatToensureconsistencyoftheexperimentenviron-inoursimulationtests,wedidnottrytoassesstheef-ment,weparseallthedomainnamesatonetime.Sothefectsofrandomfactors(e.g.,networkfailure)totheamountofcommunicationonlyincludestransferringperformanceofourproposedarchitecture.crossURLs.Everymain-controlleronlysendscrossURLsandeachURLisnolongerthan128bytes.Inthe4.1.Loadbalanceanalysisactualsystem,forimprovingutilizationofdomainnames,parseddomainnamesbyeverymain-controllerTomaintainloadbalanceofthesystem,weusehashshouldbetransmittedtoeachother.Everymain-con-functiontodynamicallyallocateURLstoeverymain-trollermaintainsatablethatincludescorrespondingcontroller.TheconsequencecanbeobtainedthroughrelationsbetweendomainnamesandIPs.EveryrecordanalyzingcollectedWebpagesbyeachmain-controllerisnolongerthan72bytes(using64bytestostorehosteveryhour.Afteranalyzingtheresultsofthefirsttenname,4bytestostoreIPand4bytestostorevisitinghours,wecandeducethefinalresultofthesystem.time).Sotheamountofcommunicationissmall.Ad-Bycomputingvariancesforthefourgroupsofre-ditionally,tomaintainthedynamicreconfigurabilityofsults,wecanknowtheirdivergencedegreethatcanthesystem,whenthenumberofmain-controllersbethoughtofasastandardtomeasurewhetherchanges,eachmain-controllerneedstomodifysomeofloadisbalancing.Computingneedsthefollowingtheirtables(e.g.,Websitetable).Tomaintainconsis-formulas.tencyoftables,thesystemneedsextraamountofcom-X1munication.However,thesituationisseldom.SoitwillEðXÞ¼xkpk;k¼1;2;...;ð1Þhavelittleaffectontheamountofcommunication(seek¼1Xthedetailin3.6).Consideringonecopyofmessage2DðXÞ¼½xkEðXÞpk;k¼1;2;...;ð2Þshouldbesenttomanymain-controllersinthepreviousTable1Loadvariancesnt1234567891020.0001100.0014540.0005010.0003098.18E)056.18E)052.14E)071.25E)052.74E)058.24E)0640.0003260.000590.0005640.0003750.0003150.0004650.0007020.0006720.0006620.00056880.0001247.04E)056.11E)054.98E)055.32E)054.18E)054.25E)057.44E)055.91E)055.79E)05161.06E)051.57E)051.43E)051.11E)051.34E)051.42E)051.48E)051.51E)051.58E)051.82E)05 H.Yanetal./TheJournalofSystemsandSoftware60(2002)185–193191Table2FourgroupsofexperimentalresultsNumberofmain-controllersLinetypesVisitedWebpagesnumberinthedistributedsystemintheFig.3CentralizedDistributedRatio2Square56199961301.7105294Diamond527121771313.3603548Plus510553048545.9710916Star2476329034411.72491twosituations,weusemulticasttechnologyintheactualsystem.4.3.ScalabilityTheresultsofthefirsttenhoursareshowninTable2andFig.3.InFig.3,theX-axisrepresentstime,whoseunitisonehour,Y-axisisthenumberofvisitedWebpages,andfourgroupsofresultsaredescribedbyvar-ioustypesoflines,foreachtypeoflines,thehigheronedepictstheresultofthedistributedsystemwithnmain-controllers,andtheloweronedepictstheresultofthecentralizedsystemrunningwiththedistributedsystem.Duetotherestrictionofresources,thedistributedsystemandthecentralizedsystemarerunninginoneFig.4.Thespeedupofdistributedsystem.computersimultaneously.ButfromFig.3wecanfindoutthatthecentralizedsystem’s(onlywithonemain-controllercomputer)performanceremainsinvariableInFig.4,X-axisisthenumberofmain-controllerswhenresourcesaresharedwithtwo,fouroreightdis-andY-axisistheratioofthenumberofWebpagestributedmain-controllers.Whilethereare16distributedcrawledbythedistributedsystemandthosebythemain-controllers,theperformancedecreasesgreatly.Oncentralizedsystem.WecanconcludefromFig.4thatYtheotherhand,whenthenumberofmain-controllersisisnearlylinearlyincreasingwiththeincreaseofXwhenbelow8,themorethemain-controllers,thehigherthenisnottoobig.Sothedistributedsystemhasagoodcrawlingefficiencyofthedistributedsystem.Finally,scalability.whenthenumberreaches16,becauseoftheoverloadonthesystemresources,theperformancewilldecreaseasthatofthecentralizedsystem.5.ExperimentalresultsThesimulationresultsdemonstratethatthesystemrealizesourdesigngoal.So,weappliedthearchitectureandmethodtoimplementWebGather2.0andgotthefollowingresultsoftheactualsystem.AlltheresultsoftheactualsystemaregotfromPCswhichhavethesameconfigurationsastheexperimentalPC,andeachmaincontrollerrunsonanindependentPC.5.1.LoadbalancingoftheactualsystemAfternormalizingresultsofthefourgroupsofex-perimentaldata,weuseformulas(1)and(2)(seetheformulain4.1)tocomputethevariances,asshowninTable3.Fig.3.CrawlingefficiencyofbothdistributedsystemandcentralizedTable3showsthat,whenthereareone,two,threeorsystem.fourmain-controllers,thevariancesareallquitesmall. 192H.Yanetal./TheJournalofSystemsandSoftware60(2002)185–193Table3Loadvariancesoftheactualsystemnt123456789101000000000020.0045746080.0047720.0054560.0050580.0049170.0045890.0044720.0047960.0046260.0045530.0042735790.0973340.0981490.0993770.0999150.1001530.1000340.1001770.1000380.10002440.0008900850.0011960.0013730.0014770.0015230.0015470.001510.0015150.0014670.001525Table4FourgroupsoftheactualresultsNumberofmain-controllersLinetypesVisitedWebpagesnumberinthedistributedsystemintheFig.6ActualsystemRatio1Square9467212Diamond1754921.8536843Plus2788332.9452534Star4166394.400868Fig.5.Crawlingefficiencyoftheactualsystem.Fig.6.Thespeedupoftheactualsystem.Thisimpliesthateachmain-controllerwillhaveaboutSotheactualdistributedsystemalsoshowsagoodthesameamountofWebpagestocrawl,andtheloadisscalability.wellbalanced.5.2.Crawlingefficiencyoftheactualsystem6.ConclusionTheresultsofthefirsttenhoursareshowninTable4TheparallelanddistributedarchitecturedescribedinandFig.5.InFig.5,theX-axisrepresentstime,withonethispaperprovidesamethodforefficientlycrawlinghourasmeasuringunit.Y-axisisthenumberofvisitedmassiveamountofWebpages.ThesimulationresultsWebpages.Fourgroupsofresultsaredepictedbydemonstratethatthesystemrealizesourdesigngoal.Thevarioustypesoflines.JustasinFig.4,X-axisinFig.6isfullydistributedcrawlingarchitectureexcelsGoogle’sthenumberofmain-controllersandY-axisistheratioofcentralizedarchitecture(BrinandPage,1998)andscalesthenumberofWebpagescrawledbythedistributedwellasmorecrawlingprocessorsareadded.Theap-systemandthosebyacentralizedsystemwithonlyoneproachforachievingdynamicreconfigurabilitydescribedmain-controller.WecanconcludefromFig.6thatYisinthispaperissimplebuteffective,whichintroduceslittlealsonearlylinearlyincreasingasXincreasesandsomeadministrationcostsandismoreapplicablethanHar-oftheexperimentalresultsareevenbetterthanthevest.Atpresent,weareapplyingthearchitectureandthecorrespondingsimulationresults(e.g.,whennequals4).methodinWebGather2.0.Intheactualsystem(visit H.Yanetal./TheJournalofSystemsandSoftware60(2002)185–193193http://e.pku.edu.cnforalook),afour-processorsystemisGoogleSearchEngine,2000.http://www.google.com.configured,anditdeliversexpectedoutcome,collectingInktomiandtheNECResearchInstitute,2000.WWWstatisticalreports.http://www.inktomi.com/Webmap.about600,000pagesaday,whichverifiesoursimulationLiu,J.,Lei,M.,Wang,J.,Chen,B.,2000.Diggingforgoldontheresulttosomeextent.Atthesametime,werealizethattheWeb:ExperiencewiththeWebGather.In:ProceedingsofthesuccessofthedistributedcrawlingsystembringsmanyFourthInternationalConferenceonHighPerformanceComputingmorenewissuesofresearch,suchasparallelanddis-intheAsia-PacificRegion,Beijing,PRChina,May14–17.IEEEtributedindexingandsearching,etc.Inaddition,webe-ComputerSocietyPress,pp.751–755.Sullivan,D.,2000.SearchEngineSizes.http://searchenginewatch.com/lievethatthearchitectureproposedinthepapercanbereports/sizes.html.usedtobuildtheinformationsysteminfrastructureindigitallibrarycontext.HonfeiYanreceivedhisB.SandM.S.fromHarbinEngineeringUni-versity,PRChina,in1996andin1999,respectively.CurrentlyheisaPh.D.candidateintheDepartmentofComputerScienceandTech-nologyofPekingUniversity.HisresearchinterestsincludedistributedAcknowledgementssystemsandalgorithms,Webinformationretrievalanddatabase.TheworkofthispaperwassupportedbytheNa-JianyongWangisanassistantprofessoroftheDepartmentofCom-tionalGrandFundamentalResearchProgram(973)ofputerScienceandTechnologyatPekingUniversity,PRChina.HeChina(GrantNo.G1999032706),andwearegratefultoreceivedhisB.S.degreefromLanZhouUniversityin1991andhisPh.D.degreefromInstituteofComputingTechnology,ChineseZhengmaoXie,JianghuaZhao,andSongweiShanforAcademyofSciencesin1999,bothincomputerscience.Hisresearchtheirhelpfulcomments.interestsincludedistributedsystemsandalgorithms,Webinformationretrieval,dataminingandbusinessintelligence.XiaomingLiisaProfessorofDepartmentofComputerScienceandReferencesTechnologyofPekingUniversity,PRChina.HereceivedhisB.SfromHarbinInstituteofTechnology,PRChina,in1982,M.S.andPh.D.Bowman,C.Mic,etal.,1995.TheHarvestInformationDiscoveryandfromStevensInstituteofTechnology,Hoboken,USA,in1983and1986,respectively.HisresearchinterestsincludeprogrammingmodelAccessSystem,TechnicalReport,UniversityofColorado,Boulder.forSMPclusters,highperformanceandcontent-basedWebinfor-Brin,S.,Page,L.,1998.Theanatomyofalarge-scalehypertextualmationretrieval,distributedWebservices,andintelligentbroadbandWebsearchengine.In:SeventhInternationalWorldWideWebmultimedianetworks.Conference,Brisbane,Australia.CNNIC,2000.ChinaInternetnetworkdevelopmentstatusstatisticalLinGuoisanundergraduatestudentintheDepartmentofComputerreports.http://www.cnnic.net.cn/develst/report.shtml.ScienceandTechnologyatPekingUniversity,PRChina.HerresearchCERNIC,2000.Informationservice.http://www.nic.edu.cn/INFO/interestsincludedistributedsystemsandalgorithms,Webinformationcindex.html.retrievalandnetworks.

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
关闭