欢迎来到天天文库
浏览记录
ID:36398311
大小:9.62 MB
页数:53页
时间:2019-05-10
《基于文本聚类的网页消重算法研究》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、北京交通大学硕士学位论文基于文本聚类的网页消重算法研究姓名:姚漫申请学位级别:硕士专业:计算机应用技术指导教师:于剑20080601ABSTRACTAstheWorldWideWebgrowsrapidlytobecomethelargestandthemostpopularsourceofreadilyavailableinformation,itisincreasinglyabundanttoaccesstoinformationsources.But,becauseofexpansionsofnetworkinformationatanexponentialrate,peoplee
2、ncounternumerousdifficultiesinIR(InformationRetrieval).Meanwhile,withwebpagesbeingeasytocopy,networkservicesaboundwithduplicateinformation.Consequently,todetectandeliminatethosepagesinfacsimileisofgreatsignificance.Inthefirstplace,thispaperintroducestraditionaldocumentclusteringalgorithmsandthedu
3、plicatepagedetectiontechnology,analyzingandsummarizingadvantagesanddisadvantagesofbothmethodsrespectively.Bymakinguseoftwooriginalclusteringalgorithms,thispaperproposesBisectingKlllealls-H-clusteringalgorithm.TheexperimentsontheUCIdatasetssuggestitsSSEvalue,demonstratingtheeffectivenessandefficie
4、ncyofthecorrectrateofclusteringandruntime.Further,inthispaper,webdocumentsareconvertedtobeHTMLdocumentswithgoodformatbyTidyandparsedtoDOMtreestructure.Accordingtocharacteristicsofthewebpagewithnoise,weproposethemaxtextblockalgorithm.Thisinnovativeapproachshouldbeusedtoeliminatenoisesofwebpagesand
5、discoverimportanttextblocksandcomprehensiveevaluationveritiesitsfeasibility.TheexperimentsonwebpagesdenoisedbythemaxtextblockalgorithmsuggesttheBisectingIOneans++algorithmshouldbeavailableintheprecisionandrecall,andtheMD5valueoftopNhigh-frequencywordsalgorithmintimeexpenditure.Lastly,theoriginala
6、lgorithmswithkeywordsandVSM(VectorSpaceModel)havebeenunabletocompletethetask,whichintegratecompanyinformationintheinformationextractionprojectofB2Bwebsites.Withpracticalproblems,theentityidentificationtechniqueshouldbeappliedtothedetectionofduplicatecompanyinformation.Theintegrationofcompanyinfor
7、mation,ontheonehand,eliminatestheduplicatedata,savesstoragespaceandimprovesUSel"experiencewithsearchengine,ontheotherhand,itcouldminedetailedcompanyinformationandprovidesascorebasedonforqualityrankalgorithm.KEYWORDS:Do
此文档下载收益归作者所有