资源描述:
《Text document clustering based on neighbors》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、Data&KnowledgeEngineering68(2009)1271–1288ContentslistsavailableatScienceDirectData&KnowledgeEngineeringjournalhomepage:www.elsevier.com/locate/datakTextdocumentclusteringbasedonneighborsabc,*CongnanLuo,YanjunLi,SoonM.ChungaTeradataCorporation,SanDiego,CA92127,USAbDepartmentofComputera
2、ndInformationScience,FordhamUniversity,Bronx,NY10458,USAcDepartmentofComputerScienceandEngineering,WrightStateUniversity,Dayton,OH45435,USAarticleinfoabstractArticlehistory:Clusteringisaverypowerfuldataminingtechniquefortopicdiscoveryfromtextdocu-Received17February2008ments.Thepartitio
3、nalclusteringalgorithms,suchasthefamilyofk-means,arereportedReceivedinrevisedform20June2009performingwellondocumentclustering.Theytreattheclusteringproblemasanoptimi-Accepted22June2009zationprocessofgroupingdocumentsintokclusterssothataparticularcriterionfunctionAvailableonline1July200
4、9isminimizedormaximized.Usually,thecosinefunctionisusedtomeasurethesimilaritybetweentwodocumentsinthecriterionfunction,butitmaynotworkwellwhentheclus-Keywords:tersarenotwellseparated.Tosolvethisproblem,weappliedtheconceptsofneighborsandDocumentclusteringlink,introducedin[S.Guha,R.Rasto
5、gi,K.Shim,ROCK:arobustclusteringalgorithmforcat-Textminingk-meansegoricalattributes,InformationSystems25(5)(2000)345–366],todocumentclustering.IfBisectingk-meanstwodocumentsaresimilarenough,theyareconsideredasneighborsofeachother.AndthePerformanceanalysislinkbetweentwodocumentsrepresen
6、tsthenumberoftheircommonneighbors.Insteadofjustconsideringthepairwisesimilarity,theneighborsandlinkinvolvetheglobalinforma-tionintothemeasurementoftheclosenessoftwodocuments.Inthispaper,weproposetousetheneighborsandlinkforthefamilyofk-meansalgorithmsinthreeaspects:anewmethodtoselectini
7、tialclustercentroidsbasedontheranksofcandidatedocuments;anewsimilaritymeasurewhichusesacombinationofthecosineandlinkfunctions;andanewheuristicfunctionforselectingaclustertosplitbasedontheneighborsoftheclustercentroids.Ourexperimentalresultsonreal-lifedatasetsdemonstratedthatourpropos