资源描述:
《基于特征选择和最大熵模型的汉语词义消歧》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、ISSN1000-9825,CODENRUXUEWE-mail:jos@iscas.ac.cnJournalofSoftware,Vol.21,No.6,June2010,pp.1287−1295http://www.jos.org.cndoi:10.3724/SP.J.1001.2010.03591Tel/Fax:+86-10-62562563©byInstituteofSoftware,theChineseAcademyofSciences.Allrightsreserved.∗基于特征选择和最大熵模型
2、的汉语词义消歧1,21,2+何径舟,王厚峰1(北京大学信息科学技术学院计算语言学研究所,北京100871)2(北京大学计算语言学教育部重点实验室,北京100871)ChineseWordSenseDisambiguationBasedonMaximumEntropyModelwithFeatureSelection1,21,2+HEJing-Zhou,WANGHou-Feng1(InstituteofComputationalLinguistics,SchoolofElectronicEngineering
3、andComputerScience,PekingUniversity,Beijing100871,China)2(KeyLaboratoryofComputationalLinguistics(MinistryofEducation),PekingUniversity,Beijing100871,China)+Correspondingauthor:E-mail:wanghf@pku.edu.cnHeJZ,WangHF.Chinesewordsensedisambiguationbasedonmaximu
4、mentropymodelwithfeatureselection.JournalofSoftware,2010,21(6):1287−1295.http://www.jos.org.cn/1000-9825/3591.htmAbstract:Wordsensedisambiguation(WSD)canbethoughtasaclassificationproblem.Featureselectionisofgreatimportanceinsuchatask.Ingeneral,featuresares
5、electedmanually,whichrequiresadeepunderstandingofthetaskitselfandtheemployedclassificationmodel.Inthispaper,theeffectoffeaturetemplateonChineseWSDisstudied,andanautomaticfeatureselectionalgorithmbasedonmaximumentropymodel(MEM)isproposed,includinguniformfea
6、turetemplateselectionforallambiguouswordsandcustomizedfeaturetemplateselectionforeachword.ExperimentalresultshowsthatautomaticfeatureselectioncanreducefeaturesizeandimproveChineseWSDperformance.ComparedwiththebestevaluationresultsofSemEval2007:task#5,thism
7、ethodgetsMicroAve(micro-averageaccuracy)increase3.10%andMacroAve(macro-averageaccuracy)2.96%respectively.Keywords:maximumentropymodel;classificationfeature;automaticfeatureselection;Chinesewordsensedisambiguation摘要:词义消歧是自然语言处理中一类典型的分类问题.在分类中,特征的选择至关重要.通常情况
8、下,特征是由人工选择的,这就要求特征选取者对于待分类的问题本身和分类模型的特点有深刻的认识.分析了汉语词义消岐中特征模板对消歧结果的影响,在此基础上提出一套基于最大熵分类模型的自动特征选择方法,包括针对所有歧义词的统一特征模板选择和针对单个歧义词的独立特征模板优化算法.实验结果表明,使用自动选择的特征,不仅简化了特征模板,而且提高了汉语词义消歧的性能.与SemEval2007:task#5的最好成绩相比,该方法分