欢迎来到天天文库
浏览记录
ID:34010195
大小:2.45 MB
页数:62页
时间:2019-03-03
《关于web的大规模平行语料库构建方法的研究》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、中文摘要基于Web的大规模甲行语料库构建方法研究抽取平行资源,并使用基于长度、双语词典、翻译模型等方法来提高平行语料库的质量。关键词:Web信息挖掘,双语平行语料库,平行资源对齐,双语混合网页获取,机器学习Ⅱ作者:冯艳卉指导教师:姚建民ResearchOilLarge-ScaleBilingu—al—Para—l—lelCorpusExtractionfromtheWebA.bstractonLarge-ScaleBilingualParallelCorpusExtractionfromtheWebAbstrac
2、tLarge-scalebilingualparallelcorpuscanbenefitmanyNaturalLanguageProcessing(NLP)applications,suchasmachinetranslationandcross-languageinformationretrieval.TherearemassivemultilingualtextresourcesontheWeb,andmostofpreviousresearchjustfocusedonextractionbilingua
3、lparallelresourcesfromparallelmonolingualpagepairs.Althoughalotofmanpower,materialandfinancialhasbeenspentinextractingsuchbilingualresources,theexistingcorpuscollectedisfarawayfromenoughtobeusedinrealtextprocessing,justbecauseofitssmallscale,poortimelinessand
4、imbalanceofdomains.Currently,researchersfindthatparallelbilingualresourcesexistnotonlyinparallelmonolingualpagepairs,butalsoinasinglebilingualpage,andbilingualpagescontainmoreparallelresourceswithhighertranslationqualityandmoredomains.Inthispaper,weonlyfocuso
5、nsuchbilingualpagesandproposetoobtainlarge-scalebilingualparallelcorpusautomaticallyfromtheWeb.Ourresearchresultcanbesummarizedasfollows:>DiscoveringbilingualpagesfromtheWebTheWebcontainsmassivepages,SOitisabigchallengetOdiscoverbilingualpagesaccurately.Previ
6、ousresearchesalwaysadoptmethodsbasedondefinedtargets,i.e.firstlycollectplentysourceWebsites(suchasEnglishlearningsiteandtranslationsiteandSOon),thendownloadallinternalpagesascandidatebilingualpages.However,theworkaboutcollectingsourcesitesmustbewithhumaninter
7、ventionandonlyoutputlimitedcandidates.Inordertoovercomesuchdisadvantages,otherresearchesproposetoautomaticallydiscoverSOurCesitesbyuseofsearchenginesandheuristicinformation,whilesuchmethodsoutputSOmanynoisypagesandminedparallelresourcesofpoorquality.Thispaper
8、firstlyproposes0discoverandextractbilingualpagesbyuseofsearchenginesandacquiredsmall-scaleparallelcorpus,andexperimentalresultsshowthatdoesbeanovelmethodtoacquirehigh-qualitybilingualpage
此文档下载收益归作者所有