资源描述:
《A generative probabilistic ocr model for nlp applications》由会员上传分享,免费在线阅读,更多相关内容在行业资料-天天文库。
1、ProceedingsofHLT-NAACL2003MainPapers,pp.55-62Edmonton,May-June2003AGenerativeProbabilisticOCRModelforNLPApplicationsOkanKolakWilliamByrnePhilipResnikComputerScienceandUMIACSCLSPLinguisticsandUMIACSUniversityofMarylandTheJohnsHopkinsUniversityUniversityofMarylandCollegePark,MD20742,USABaltimore,
2、MD21218,USACollegePark,MD20742,USAokan@umiacs.umd.edubyrne@jhu.eduresnik@umiacs.umd.eduAbstractbythefactthatmostOCRsystemareblackboxesthatdonotallowusertuningorre-training—Baird(1999,re-Inthispaper,weintroduceagenerativeprob-portedin(Frederking,1999))commentsthatthelackofabilisticopticalcharact
3、errecognition(OCR)abilitytorapidlyretargetOCR/NLPapplicationstonewmodelthatdescribesanend-to-endprocessinlanguagesis“largelyduetothemonolithicstructureofthenoisychannelframework,progressingfromcurrentOCRtechnology,wherelanguage-specificcon-generationoftruetextthroughitstransforma-straintsaredeep
4、lyenmeshedwithalltheothercode.”tionintothenoisyoutputofanOCRsystem.Inthispaper,wedescribeacompleteprobabilistic,Themodelisdesignedforuseinerrorcorrec-generativemodelforOCR,motivatedspecificallyby(a)tion,withafocusonpost-processingtheoutputtheneedtodealwithmonolithicOCRsystems,(b)thefo-ofblack-bo
5、xOCRsystemsinordertomakecusonOCRasacomponentinNLPapplications,and(c)itmoreusefulforNLPtasks.WepresentantheultimategoalofusingOCRtohelpacquireresourcesimplementationofthemodelbasedonfinite-fornewlanguagesfromprintedtext.Afterpresentingstatemodels,demonstratethemodel'sabilitythemodelitself,wediscu
6、ssthemodel'simplementation,tosignificantlyreducecharacterandworder-training,anditsuseforpost-OCRerrorcorrection.Werorrate,andprovideevaluationresultsinvolv-thenpresenttwoevaluations:oneforstandaloneOCRingautomaticextractionoftranslationlexiconscorrection,andoneinwhichOCRisusedtoacquireafromprint
7、edtext.translationlexiconfromprintedtext.Weconcludewithadiscussionofrelatedresearchanddirectionsforfuture1Introductionwork.Althoughagreatdealoftextisnowavailableinelec-2TheModeltronicform,vastquantitiesofinformationstillexistpri-m