资源描述:
《A Statistical Approach to Extract Chinese Chunk Candidates From Large Corpora》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、AStatisticalApproachtoExtractChineseChunkCandidatesfromLargeCorporaZHANGLe,LUXue-qiang,SHENYan-na,YAOTian-shun¨InstituteofComputerSoftware&Theory.SchoolofInformationScience&Engineering,NortheasternUniversityShenyang,110004ChinaEmail:ejoy@xinhuanet.com,studystrong@sohu.com,neusyn
2、@sohu.com,tsyao@mail.neu.edu.cnAbstractTheextractionofChunkcandidatesfromrealcorporaisoneofthefundamentaltasksofbuildingexample-basedmachinetranslationmodel.ThispaperpresentsastatisticalapproachtoextractChinesechunkcandidatesfromlargemonolingualcorpora.Thefirststepistoextractlarg
3、eN-grams(upto20-gram)fromrawcorpus.ThentwonewlyproposedFastStatisticalSubstringReduction(FSSR)algorithmscanbeappliedtotheinitialN-gramsettoremovesomeunnecessaryN-gramsusingtheirfrequencyinformation.Thetwoalgorithmsareefficient(bothhaveatimecomplexityofO(n))andcaneffectivelyreduceth
4、esizeofN-gramsetupto50%.Finally,mutualinformationisusedtoobtainchunkcandidatesfromreducedN-gramset.PerhapsthebiggestcontributionofthispaperisthatitisthefirsttimetoapplyFastStatisticalSubstringReductionalgorithmtolargecorporaanddemonstratetheeffectivenessandefficiencyofthisalgorithmw
5、hich,inourhope,willshednewlightonlargescalecorpusorientedresearch.Experimentsonthreecorporawithdifferentsizesshowthatthismethodcanextractchunkcandidatesfromcorporaofgigabytesefficientlyundercurrentcomputationalpower.Wegetanextractionaccuracyof86.3%fromPeopleDaily2000newscorpus.KeyW
6、ords:Chunkextraction,N-gram,SubstringReduction,Corpus1IntroductionWiththerapiddevelopmentofcomputationalpowerandtheavailabilityoflargeonlinecorpora(BNC(Clear,1993),PeopleDaily(YUetal,2002)),therehasbeenadramaticshiftincomputatio¨nallinguisticsfrommanuallyconstructionknowledgebas
7、estopartiallyortotallyautomaticknowledgeacquisitionbyapplyingstatisticallearningmethodstolargecorpora(seeSU,1996,foranoverview).Theconceptofchunkwasfirstraisedby(Abney,1991)intheearlyninetiestomakethetaskoflanguageparsingeasier.Hesuggestedtodevelopaparserbasedonchunkthatdecompose
8、ssentencesintochunkswitheachchunkbeingasyntacti