资源描述:
《clustering massive text data streams by semantic smoothing model》由会员上传分享,免费在线阅读,更多相关内容在教育资源-天天文库。
1、ClusteringMassiveTextDataStreamsbySemanticSmoothingModelYubaoLiu1,JiarongCai1,JianYin1,andAdaWai-CheeFu21DepartmentofComputerScienceofSunYat-SenUniversity,Guangzhou,510275,Chinaliuyubao@mail.sysu.edu.cn,kelvin2004_cai@163.com,issjyin@mail.sysu.edu.cn2Dep
2、artmentofComputerScienceandEngineering,theChineseUniversityofHongKong,HongKongadafu@cse.cuhk.edu.hkAbstract.Clusteringtextdatastreamsisanimportantissueindataminingcommunityandhasanumberofapplicationssuchasnewsgroupfiltering,textcrawling,documentorganizat
3、ionandtopicdetectionandtracingetc.However,mostmethodsaresimilarity-basedapproachesandusetheTF*IDFschemetorepresentthesemanticsoftextdataandoftenleadtopoorclusteringquality.Inthispaper,wefirstlygiveanimprovedsemanticsmoothingmodelfortextdatastreamenvironm
4、ent.Thenweusetheimprovedsemanticmodeltoimprovetheclusteringqualityandpresentanonlineclusteringalgorithmforclusteringmassivetextdatastreams.Inouralgorithm,anewclusterstatisticsstructure,clusterprofile,ispresentedinwhichthesemanticsoftextdatastreamsarecapt
5、ured.Wealsopresenttheexperimentalresultsillustratingtheeffectivenessofourtechnique.Keywords:SemanticSmoothing,TextDataStreams,Clustering.1IntroductionClusteringtextdatastreamsisanimportantissueindataminingcommunityandhasanumberofapplicationssuchasnewsgro
6、upfiltering,textcrawling,documentorganizationandTDT(topicdetectionandtracing)etc.Insuchapplications,textdatacomesasacontinuousstreamandthispresentsmanychallengestotraditionalstatictextclustering[1].Theclusteringproblemhasrecentlybeenstudiedinthecontextof
7、numericdatastreams[2,3].But,thetextdatastreamsclusteringresearchisonlyontheunderwaystage.In[4],anonlinealgorithmframeworkbasedontraditionalnumericdatastreamsclusteringapproachispresentedforcategoricalandtextdatastreams.In[4],theconceptofclusterdropletisu
8、sedtostorethereal-timecondensedclusterstatisticsinformation.Whenadocumentcomes,itwouldbeassignedtothesuitableclusterandthenthecorrespondingclusterdropletisupdated.Thisframeworkalsodistinguishesthehistoricaldocumentswiththe