资源描述:
《A Comparative Study on Representing Units in Chinese Text Clustering》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、AComparativeStudyonRepresentingUnitsinChineseTextClustering1,21222WangHongjun,YuShiwen,LvXueqiang,ShiShuicai,andXiaoShibin1InstituteOfComputingLinguisticsPekingUniversity,Beijing100080;2ChineseInformationProcessingCenterBeijingInformationTechnologyInstitute,Beijing100101wang.hongjun@trs.com.cnAbstra
2、ct.Wordsandn-gramsarecommonlyusedChinesetextrepresentingunitsandareprovedtobegoodfeaturesforChineseTextCategorizationandInformationRetrieval.ButtheeffectivenessofapplyingtheserepresentingunitsforChineseTextClusteringisstilluncovered.ThispaperisacomparativestudyofrepresentingunitsinChineseTextCluster
3、ing.WithK-meansalgorithm,severalrepresentingunitswereevaluatedincludingChinesecharacterN-gramfeatures,wordfeaturesandtheircombinations.WefoundChinesewordfeatures,Chinesecharacterunigramfeaturesandbi-gramfeaturesmosteffectiveinourexperiments.Thecombinationoffeaturesdidn’timprovetheresults.Detailedexp
4、erimentalresultsonseveralpublicChineseTextCategorizationdatasetsareprovidedinthepaper.Keywords:ChinesetextClustering;N-gramfeature;Bi-gramfeature;Wordfeature.1IntroductionTextclusteringhasbeeninvestigatedforuseinanumberofdifferentareasoftextminingandinformationretrieval.Itplaysanimportantroleforeffi
5、cientdocument[1][2][3][4][5]organization,summarization,navigationandretrieval.Intextclustering,atextordocumentisalwaysrepresentedasabagofwords.ThereisnoboundarybetweenChinesewords,sosegmentationisthebasisforChineseTextProcessing.Manyeffectivesegmentationmethodshavebeenproposedinthepreviousstudies.Ho
6、wever,whenalargenumberofnewwordssuchasnames,locationnamesandcompanynamesappearinthetext,theresultofsegmentationis[6]usuallydissatisfactory.SomeresearcherstriedtouseChinesecharacterN-gramfeaturesinChinesetextcategorizationandinformationretrievalandproposedtheir[7][8][9]experimentresults.Buthowtochoos
7、eappropriaterepresentingunitsforChinesetextclusteringisstillaproblem.ThispaperusesChinesewords,N-gramsandtheircombinationsasrepresentingunitsandcomparestheirperformanceindocumentclustering.Theexperime