资源描述:
《reputation-based contents crawling in web》由会员上传分享,免费在线阅读,更多相关内容在工程资料-天天文库。
1、The7thInternationalSymposiumonOperationsResearchandItsApplications(ISORA’08)Lijiang,China,October31–Novemver3,2008Copyright©2008ORSC&APORC,pp.317–324Reputation-basedContentsCrawlinginWebArchivingSystemHiroyukiKawano∗NanzanUniversity,Aichi4890863AbstractThesizeofthewebarchivei
2、sincreasingexponentially,manynationallibrariesaremakingeffortstopreserveborn-digitalscientific,artisticandculturalcontents.However,inordertocrawlandstorehugevolumeofdigitalinformation,itisveryhardtoresolvevariousproblemsfromthesocial,legalandtechnicalviewpoints.Inthispaper,fro
3、mtheviewpointsoflong-termpreservingdigitalcontentswithgoodreputationoftrustiness,uniquenessandvaluation,wediscussstrategiestopreservemonotonouslyincreasingdigitalcontentsonwebservers.Accordingtoex-perimentalresultsofourreputationmodel,itmakespossibletocrawlsociallyvaluablecon
4、tentsforarchiving.KeywordsWebArchive,WebCrawling,ReputationManagement1IntroductionRecentyears,thesizeofthewebsystemsisincreasingexponentially,soitisbecominghardtokeepthequalityandsocialstructureofwebcontentsandtopreservevaluablewebresources.Forexample,in2001,thereexist1billio
5、npagesonsurfaceweband550billionpagesindeepweb1.In2003,thevolumeofwebdatais167TBofsurfaceweband92PBofdeepweb2.Furthermore,thenumberofpagespublishedonthewebserversisappearinganddisappearing.Manypublicorganizationssuchas“NationalLibraries”andIIPC(Inter-nationalInternetPreservati
6、onConsortium,www.netpreserve.org),aremakingeffortstopreservethesecontents[2]inordertopreservethehugevolumeofborn-digitalinforma-tionintheinternet,includingscientific,artisticandculturalcontentsprovidedbyvariouswebsystems.Manyresearchersdiscussvarioustechnicalproblemsinordertod
7、evelopbetterwebarchives.Therefore,inordertoarchivemonotonouslyincreasingdigitalcontents,wealsodis-cussmanycrawlingandpreservingproblemsfromvarioustechnicalaspects[10].Forinstance,thereareoptimizingproblemsofhardwareandnetworkcostsforoperationofarchivingserviceandexecutionofwe
8、bcrawlingfromvariouswebservicesandsystems[4,∗FromNovemberin2002,thea