资源描述:
《基于图的同义词集自动获取方法》由会员上传分享,免费在线阅读,更多相关内容在行业资料-天天文库。
1、计算机研究与发展ISSN10001239CN111777TPJournalofComputerResearchandDevelopment48(4):610616,2011基于图的同义词集自动获取方法1,21,23吴云芳石静金澎1(计算语言学教育部重点实验室(北京大学)北京100871)2(北京大学计算语言学研究所北京100871)3(乐山师范学院智能信息处理及应用实验室四川乐山614000)(wuyf@pku.edu.cn)GraphBasedAutomaticAcquisitionofSemanti
2、cClasses12123WuYunfang,ShiJing,andJinPeng1(KeyLaboratoryofComputationalLinguistics(PekingUniversity),MinistryofEducation,Beijing100871)2(InstituteofComputationalLinguistics,PekingUniversity,Beijing100871)3(LaboratoryofIntelligentInformationProcessingandApplication,Lesha
3、nNormalUniversity,Leshan,Sichuan614000)AbstractAsemanticclassisacollectionoftermswhichsharesimilarmeaning.Knowingthesemanticclassesofwordscanbeextremelyvaluableformanynaturallanguageprocessingtasks.Thispaperinvestigatestheusageoflinguisticknowledgeonthegraphbasedacqui
4、sitionofChinesesemanticclasses,anddemonstratesthatlinguisticknowledgecanreallyimprovethegraphbasedmethod.TheusedcorpusisXinhuaNewsofLDCChineseGigaword.Agraphisbuiltbyextractingwordpairswithcoordinationstructurefromcorpus,withthecooccurringwordsasnodesandthecooccurrin
5、gfrequencyasedgesweightbetweenthetwowords.AndthenNewmanalgorithmisadoptedtoexperimentwordclusteringinthegraph.Thispaperfocusesontransformingtheedgesweight,motivatedbythepropertiesofcoordinatestructureandChineselanguage.Wepresentsixkindsofmethods:dividethewholecorpusto
6、smallparts,cutthelowfrequencyedges,enlargetheweightofbidirectionaledges,enlargetheweightofedgeswithincliques,enlargetheweightofedgesinwhichtwonodessharethesamelastcharacter,andreducetheweightofedgesinwhichtwonodeshavedifferentnumberofcharacters.Theexperimentalresultwi
7、ththesixmethodsyieldsapromisingprecisionof53.12%,whichoutperformthebaselineNewmanalgorithmby29.84%.Keywordssimilarwords;semanticclass;graphmodel;coordinatestructure;Newmanalgorithm;edgeweight摘要同义词集是重要的语言基础知识,基于大规模语料库的同义词集自动获取是自然语言处理领域的一项基础性研究课题.从大规模语料中自动获取有并列结构关联的词语对
8、,据此形成图,采用Newman算法对图进行划分而自动聚类相似词语.着重研究在Newman算法的基础上,充分挖掘和利用并列结构的特性和汉语的构词特点,采用6种方法对图中边的权值加以改进从而提升效果:分割语料、去除低频边、加