资源描述:
《Similarity Measures for Text Document Clustering》由会员上传分享,免费在线阅读,更多相关内容在行业资料-天天文库。
1、SimilarityMeasuresforTextDocumentClusteringAnnaHuangDepartmentofComputerScienceTheUniversityofWaikato,Hamilton,NewZealandlh92@waikato.ac.nzABSTRACTthatautomaticallyorganizesacollectionwithasubstan-Clusteringisausefultechniquethatorganizesalargequan-tialnumberofdataobjectsinto
2、amuchsmallernumbertityofunorderedtextdocumentsintoasmallnumberofofcoherentgroups[8,20].Intheparticularscenariooftextmeaningfulandcoherentclusters,therebyprovidingaba-documents,clusteringhasproventobeaneffectiveapproachsisforintuitiveandinformativenavigationandbrowsingforquites
3、ometime—andaninterestingresearchproblemasmechanisms.Partitionalclusteringalgorithmshavebeenwell.Itisbecomingevenmoreinterestinganddemandingrecognizedtobemoresuitableasopposedtothehierar-withthedevelopmentoftheWorldWideWebandtheevo-chicalclusteringschemesforprocessinglargedata
4、sets.AlutionofWeb2.0.Forexample,resultsreturnedbysearchwidevarietyofdistancefunctionsandsimilaritymeasuresenginesareclusteredtohelpusersquicklyidentifyandfocushavebeenusedforclustering,suchassquaredEuclideanontherelevantsetofresults.Customercommentsareclus-distance,cosinesimi
5、larity,andrelativeentropy.teredinmanyonlinestores,suchasAmazon.com,toprovidecollaborativerecommendations.Incollaborativebookmark-Inthispaper,wecompareandanalyzetheeffectivenessingortagging,clustersofusersthatsharecertaintraitsareofthesemeasuresinpartitionalclusteringfortextdoc
6、u-identifiedbytheirannotations.mentdatasets.OurexperimentsutilizethestandardK-meansalgorithmandwereportresultsonseventextdoc-Textdocumentclusteringgroupssimilardocumentsthattoumentdatasetsandfivedistance/similaritymeasuresthatformacoherentcluster,whiledocumentsthataredifferentha
7、vebeenmostcommonlyusedintextclustering.haveseparatedapartintodifferentclusters.However,thedefinitionofapairofdocumentsbeingsimilarordifferentisCategoriesandSubjectDescriptorsnotalwaysclearandnormallyvarieswiththeactualprob-lemsetting.Forexample,whenclusteringresearchpapers,H.3.3
8、[InformationSearchandRetrieval]:Clustering;twodocumentsareregardedas