资源描述:
《A Trigram Statistical Language Model Algorithm for Chinese Word Segmentation》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、ATrigramStatisticalLanguageModelAlgorithmforChineseWordSegmentationJunMao1,GangCheng1,YanxiangHe1,andZehuanXing21ComputerSchool,WuhanUniversity,Wuhan430072,P.R.China2DepartmentofLinguistics,CentralChinaNormalUniversityWuhan430079,P.R.ChinaAbstract.Wead
2、dresstheproblemofsegmentingaChinesetextintowords.Inthispaper,weproposeatrigrammodelalgorithmforsegment-ingaChinesetext.WealsodiscusswhystatisticallanguagemodelisappropriatetobeappliedtoChinesewordsegmentationandgiveanal-gorithmforsegmentingaChinesetext
3、intowords.Inparticular,wesolvetheproblemofsearchingwhichoftenleadstolowperformancebroughtbytrigrammodel.Finally,theissueofOOVwordidentificationisdis-cussedandmergedtotrigrammodelbasedmethodinordertoimprovetheaccuracyofsegmentation.1IntroductionInmanyapp
4、licationsofnaturallanguageprocessing,weintendtoobtainandanalyzebasiclinguisticunits,usuallywords.Forexample,countingandindexingthefrequencyofeverywordisoftenusedininformationretrieval.ForEnglishandotherwesternlanguages,thesegmentationoftextsisnotnecess
5、aryatall.Sentencesinthoselanguagesarealwaysnaturallysegmentedintoindependentwordsbyusingspacesandpunctuationswhicharecalledworddelimiters.ButforAsianlanguageslikeChineseandJapanese,thingsarequitedifferent.Asanideographic,Chinesesentencesarecomposedofcha
6、racterswithoutanyspaces,andevennoneofanypunctuationexistsinancientChinesetexts.InChinesetradition,eachcharactercorrespondstoasinglesyllable.MostwordsinallmodernvarietiesofChinesearepolysyllabicandthustheyareusuallymadeupoftwoormorecharacters.Thus,Chine
7、sewordsegmentationwhichistofindwordboundariesisacrucialtaskforapplicationsinnaturallylanguageprocessinglikemachinetranslation,informationretrieval,etc.ResearchesonChinesewordsegmentationhavebeenconductedformanyyears.Manymethodsaimingtoresolvetheproblemh
8、avebeenproposed.Gen-erally,thesecanbeclassifiedintoheuristicdictionary-basedmethods,statisticalmachinelearningmethods,andhybridmethods.Insomesense,thesemethodsarepracticaltosegmentChinesetexts.However,dictionary-based,statistical-basedap