资源描述:
《基于多粒度树模型的Web 站点描述及挖掘算法》由会员上传分享,免费在线阅读,更多相关内容在行业资料-天天文库。
1、1000-9825/2004/15(09)1393©2004JournalofSoftware软件学报Vol.15,No.9∗基于多粒度树模型的Web站点描述及挖掘算法1+1,21,2,3田永鸿,黄铁军,高文1(中国科学院计算技术研究所,北京100080)2(中国科学院研究生院,北京100039)3(哈尔滨工业大学计算机科学与工程系,黑龙江哈尔滨150001)AWebSiteRepresentationandMiningAlgorithmUsingtheMultiscaleTreeModel1+1,21,2,3TIANYong-Hong,HUANGTie-Jun,GAOWen1(Institu
2、teofComputingTechnology,TheChineseAcademyofSciences,Beijing100080,China)2(GraduateSchool,TheChineseAcademyofSciences,Beijing100039,China)3(DepartmentofComputerScience,HarbinInstituteofTechnology,Harbin150001,China)+Correspondingauthor:Phn:+86-10-82649529,Fax:+86-10-82649298,E-mail:yhtian@jdl.ac.cn,h
3、ttp://www.jdl.ac.cnReceived2003-06-02;Accepted2003-07-08TianYH,HuangTJ,GaoW.AWebsiterepresentationandminingalgorithmusingthemultiscaletreemodel.JournalofSoftware,2004,15(9):1393~1404.http://www.jos.org.cn/1000-9825/15/1393.htmAbstract:Withtheexponentialgrowthofboththeamountandthediversityofthewebinf
4、ormation,websiteminingishighlydesirableforautomaticallydiscoveringandclassifyingtopic-specificwebresourcesfromtheWorldWideWeb.Nevertheless,existingwebsiteminingmethodshavenotyethandledadequatelyhowtomakeuseofallthecorrelativecontextualsemanticcluesandhowtodenoisethecontentofwebsiteseffectuallysoasto
5、obtainabetterclassificationaccuracy.Thispapercircumstantiatesthreeissuestobesolvedfordesigninganeffectiveandefficientwebsiteminingalgorithm,i.e.,thesamplingsize,theanalysisgranularity,andtherepresentationstructureofwebsites.Onthebasis,thispaperproposesanovelmultiscaletreerepresentationmodelofwebsite
6、s,andpresentsamultiscalewebsiteminingapproachthatcontainsanHMT-basedtwo-phaseclassificationalgorithm,acontext-basedinterscalefusionalgorithm,atwo-stagetext-baseddenoisingprocedure,andanentropy-basepruningstrategy.Theproposedmodelandalgorithmsmaybeusedasastarting-pointforfurtherinvestigatingsomerelat
7、edissuesofwebsites,suchasqueryoptimizationofmultiplesitesandwebusagemining.Experimentsalsoshowthattheapproachachievesinaverage16%improvementinclassificationaccuracyand34.5%reductioninprocessingtimeove