资源描述:
《基于深度学习模型的CCG超标注.pdf》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
嗨!thesis博士学位论文基于深度学习模型的CCG超标注CCGSUPERTAGGINGBASEDONDEEPLEARNINGMODELSREKIAKADARI哈尔滨工业大学2018年06月 国内图书分类号:TP391.1学校代码:10213国际图书分类号:681.324密级:公开工学博士学位论文基于深度学习模型的CCG超标注博士研究生:REKIAKADARI导师:刘挺教授申请学位:工学博士学科:计算机科学与技术所在单位:计算机科学与技术学院答辩日期:2018年06月授予学位单位:哈尔滨工业大学 ClassifiedIndex:TP391.1U.D.C:681.324DissertationfortheDoctoralDegreeinEngineeringCCGSUPERTAGGINGBASEDONDEEPLEARNINGMODELSCandidate:REKIAKADARISupervisor:Prof.LiuTingAcademicDegreeAppliedfor:DoctorofEngineeringSpecialty:ComputerScienceAffiliation:SchoolofComputerScienceandTechnologyDateofDefence:June,2018Degree-Conferring-Institution:HarbinInstituteofTechnology 摘要摘要如何让计算机理解并处理人类语言是人工智能领域的长盛不衰的研究课题。使用自然语言与具有人工智能的计算机交互常被称为自然语言处理(NLP)。自然语言处理在我们日常生活中应用十分广泛。序列标注是自然语言处理领域中历史最悠久的研究课题之一,包括词性标注(Partofspeechtagging)和CCG超标注(CombinatoryCategorialGrammarsupertagging,组合范畴语法超标注)。CCG超标注是许多自然语言处理任务的前序步骤,例如组块(chunking)和句法解析(parsing)。CCG超标注可定义为:给定一个由词构成的序列,要求给序列中的每个词赋予一个CCG超标签。CCG超标注的最大挑战在于超标签的总数巨大,以及每个词可赋予的超标签数目众多,这使得许多应用非常复杂。前人提出过许多方法来应对这一问题,这些方法通常基于不同的统计机器学习方法。同时这些方法通常使用大量人工设计的表示和输入特征来取得较好的实验效果。但是,如何自动地提取输入的表示特征也是研究的重点。深度学习可以看成是机器学习和表示学习的结合,可以自动学习有用的特征和输入表示。因此我们将尝试使用深度学习技术处理CCG超标注任务。在本文中,我们专注于CCG超标注这一任务,提出了一些技术,可以让赋予每个输入词的词法类别数目减少。我们的目标是开发一个简单而准确的模型来解决CCG超标注的挑战,同时利用深度神经网络学习必要的间接表示以避免复杂的人工特征选择。我们认为现有的CCG超标注有三个主要问题。第一个问题是长序列建模问题,即循环神经网络(RNNs)只能记忆较少的步骤,难以建模较长的序列。由于深度学习模型能从输入的依存中受益,而统计机器学习算法能够从输出的依存中受益;因此第二个问题是对于CCG超标注这一结构预测任务,如何同时从输入和输出依存中学习,这是非常必要的。最后,第三个问题是未登录词(OOV)的问题,即未登录词和罕见词会降低模型的准确率。因此,本文的主要目标是使用深度学习技术解决上述CCG超标注任务中的问题,并有效降低所预测的超标签的个数。此外,要避免使用词法特征以及其他手工构建的特征。特别地,以下问题是本文着重考虑的:1)如何记忆序列信息是许多序列标注问题的关键任务,CCG超标注亦是如此。我们提出了一个基于门限循环单元(GRU)网络的新方法。为了同时保存从-I- 哈尔滨工业大学工学博士学位论文左到右和从右到左的信息,我们应用了双向门限循环单元。此外,我们采用了深度结构来学习输入间的复杂交互。所提的方法的试验结果提升了CCG语法的超标注和多标注的性能。2)我们为CCG超标注提出了一个新的方法,叫做“后向-双向长短时记忆网络(Backward-BLSTM)”。长短时记忆网络(LSTM)作为一个比门限循环单元更有效的模型,它能更好地记忆信息以及预测最可能的超标签。我们提出的结构对于CCG语法的超标注和多标注都是有效的。试验结果表明所我们提出的方法能有效地建模长序列,同时能达到领先的性能。3)前人为CCG超标注这一任务提出了许多模型。然而这些模型要么是使用基于手工构建特征的机器学习方法,要么虽然是基于深度学习的模型但是却忽略了临近输出标签之间的依存关系,而这一关系对于预测当前标签十分重要。因此,如何利用临近的输出标签来预测当前位置的标签是关键。在这项工作中,我们同时利用了条件随机场(CRF)和双向长短时记忆网络。这个模型首先使用双向长短时记忆网络学习句子表示,同时获取过去和未来的输入并长距离地记忆这些数据。然后,模型使用条件随机场来处理句子级别的标签信息并输出预测。这个模型能够同时从输入和输出中受益,性能优于当前最好的方法。试验结果表明所提方法在CCG超标注和多标注上超越了现有的方法。4)尽管许多工作已经利用深度学习模型来解决CCG超标注的问题,仍然没有一项研究来深入解决未登录词的问题。考虑到这一点,我们提出了一种简洁而有效的方法来探索不同的输入表示。为表示词间的形态信息,首先使用预训练的词向量来提取词之间的相似度。然后我们使用字符级别的输入表示,建立了字符与向量间的检索表。然后把字符级别和词级别的表示拼接到一起,送入双向长短时记忆网络来产生输出。试验结果表明我们的方法在领域内和领域外的数据集上都要优于仅使用基于词的输入表示的模型。对于CCG超标注这一问题,我们进行了深入研究,并指出了现有公开技术的局限。基于这一分析,我们有条理地提出并实现了解决问题的新方法,并在若干数据集上验证了方法的有效性。试验结果证明了所有提出技术的有效性。关键词:自然语言处理,组合范畴语法,CCG超标注,深度学习,神经网络-II- AbstractAbstractMakingcomputersunderstandthehumanlanguagesandmanipulatethemhasbeenasubjectofresearchinArtificialIntelligence(AI)forlongyears.InteractingwithcomputerswithAIsystemsusingnaturallanguagesisoftenreferredtoasNaturalLanguageProcessing(NLP).NLPhasmanyapplicationsthatarewidelyusedinourdailylives.SequencelabelingisoneoftheoldestfieldsinNLPincludingmanytaskssuchaspartofspeechtaggingandCombinatoryCategorialGrammar(CCG)supertagging.TheCCGsupertaggingisdevotedasthefirstimportantstageformanyNLPap-plicationsfollowingwhichfurtherprocessinglikechunkingandparsingaredone.TheCCGsupertaggingcanbedefinedas:givenasequenceofwords,thegoalistoassignaCCGsupertagtoeachwordinthesequence.ThemajorchallengingproblemoftheCCGsupertaggingisrelatedtothehugenumberofthecategorysetandthelargenumberoftheassignedcategoriestoeachitemwhichmakesmanyapplicationsverycomplex.ThisbecomesacriticaltaskintheNLPcommunity.ConsiderableapproacheshavebeenproposedtodealwiththeCCGsupertaggingproblemwheresolutionsareoftenbasedondifferentstatisticalmachinelearningmodels.However,mostcurrentmachinelearningmethodsworkwellbecauseofwelldesignedhumanrepresentationsandinputsfeatures.Inrecentresearch,automaticallyextractingfeaturesthatcontaininformationaboutinputrepresentationsisveryimportant.Deeplearningcanbeseenasputtingbacktogetherrepresentationlearningwithmachinelearning.Itattemptstojointlylearngoodfeaturesandinputsrepresentations.Inthisthesis,wefocusontheCCGsupertaggingtask,inordertoproposeanddevelopsometechniques,whichallowreducingthenumberoftheassignedlexicalcategoriestoeachwordinaninput.OurgoalisthedevelopmentofsimpleandaccuratemodelsthatcansolvethechallengingproblemofCCGsupertaggingandlearnthenecessaryintermediaterepresentationsofinputentrieswithouttheneedforextensivefeaturesengineeringbasedondeeplearningmodels.WebelievethattherearethreemainproblemsofthecurrentCCGsupertaggingmodels.ThefirstproblemisrelatedtomodellongsequenceswhereRecurrentNeuralNetworks(RNNs)failtodoandtendtomemorizeinformationjustfor-III- 哈尔滨工业大学工学博士学位论文fewtimesteps.Becausedeeplearningmodelsbenefitfrominputlevelsandstatisticalmachinelearningalgorithmsbenefitfromoutputdependencies;thesecondproblemisrelatedtooutputsdependenciesasthenecessityofamodelthatcanbenefitfrombothinputandoutputdependenciesisverynecessarytotheCCGsupertaggingasastructuredpredictiontask.AndthethirdproblemisrelatedtotheOut-Of-Vocabulary(OOV)wordswheretheexistingmodels’accuracydecreaseinthepresenceofunseenandrarewords.Forthisreason,thegeneralobjectiveofthisthesisistoproposenoveltechniquesfortheCCGsupertaggingproblembasedondeeplearningmethods,inordertoimprovethecapabilitytoreducethenumberofthepredictedsupertagsandsolvetheabovementionedproblems.Furthermore,nolexicalorhand-craftedfeatureswererequired.Inparticularizethefollowingspecificissuesareconsideredinthiswork:1)Howtomemorizeinformationfromsequentialdata,isstillacriticaltaskformanysequencetaggingtasksandfortheCCGsupertagginginparticular.WepresentanewmethodforCCGsupertaggingbasedonGatedRecurrentUnit(GRU)networks.InordertosaveinputdatafrombothleftandrightdirectionBidirectionalGRU(BGRU)modelisused.Moreover,adeeparchitectureisadoptedinordertolearncomplexinteractionsbetweeninputentries.Thereportedresultsoftheproposedmodelimprovethesupertaggingandmulti-taggingperformancefortheCCGgrammar.2)Wepresentanewmethodnamed"Backward-BLSTM"forCCGsupertagging.LongShort-TermMemory(LSTM)networksareadoptedasamorepowerfulmethodthanGRUnetworkstomemorizeinformationandtoselectthemostlikelypredictedsupertag.Theproposedarchitectureprovesitsefficiencyforbothsupertaggingandmulti-taggingfortheCCGgrammar.Theexperimentalresultsshowthattheproposedmodelisefficienttomodellongsequencesandstillachievesgoodperformancethanthe-state-of-the-artproposedmodels.3)ManyapproacheshavebeenproposedfortheCCGsupertaggingtask.However,thesemodelswhetherusemanyhand-craftedfeaturescaseofmachinelearningstrategiesorusesentencelevelrepresentationprocessingasequencewithoutanycorrelationbetweenlabelsinneighborhoodswhichhavegreatinfluenceonpredictingthecurrentlabelcaseofdeeplearningmodels.LabelingagivensequencewithasetofCCGsyntacticcategoriesandtakingintoaccountthetaglevelisaverycriticalpoint;inthiswork,weusethe-IV- AbstractcombinationofConditionalRandomFields(CRF)andBLSTMmodels.SofirstthemodellearnssentencerepresentationwherewecangainfrombothpastandfutureinputfeaturesandstorethedataforlongperiodsthankstotheBLSTMarchitecture.Afterward,themodelusessentenceleveltaginformationthankstoaCRFlayerwhichisregardedastheoutputpredictor.Themodelallowsbenefitingfrombothinputandoutputentriesandismorecompetentthanstate-of-the-artmethods.TheachievedresultsdemonstratethattheproposedmodeloutperformstheexistingapproachesforbothCCGsupertaggingandmulti-tagging.4)EventhoughsomeliteraturehasmadeadvantagesofdeeplearningmodelsforCCGsupertagging,therestillnocomprehensiveresearchonhowtodealwithOOVentries.Withthisinmind,wepresentanewmethodwhichexploresthestrengthsofdifferentembeddingsinasimpleandeffectiveway.Torepresentmorphologicalinformationbetweenwords;thepre-trainedwordembeddingsareusedtoextractinformativesimilaritybetweenwords.Then,weusedcharactersembeddingsinwhicharemappedthelookuptablesofcharacters.BLSTMnetworksareusedforbothcharactersandwordsembeddingsthenconcatenatedtogethertogeneratethefinaloutputs.Theexperimentalresultsshowthatourmethodproducedthebestperformancethanwordembeddingsbasedmodelsonbothin-domainandout-of-domaindatasets.FortheCCGsupertaggingproblem,adeepstudyoftheliteratureiscarriedout,andthelimitationsofthecurrentlypublishedtechniquesarehighlighted.Startingfromthisanalysis,novelapproachesaretheoreticallyproposed,implementedandtestedonseveraldatasetstoverifytheireffectiveness.Theachievedexperimentalresultsconfirmtheeffectivenessofalltheproposedtechniques.Keywords:NaturalLanguageProcessing,CombinatoryCategorialGrammar,CCGSu-pertagging,DeepLearning,NeuralNetworks-V- 哈尔滨工业大学工学博士学位论文ContentsAbstract(InChinese).............................................................................IAbstract(InEnglish)...........................................................................IIIIndexoffigure.....................................................................................XIndexoftable....................................................................................XIIChapter1Introduction.........................................................................11.1Motivation..................................................................................11.2TheCCGSupertaggingTask............................................................31.3ApplicationsofCCGSupertagging.....................................................51.4CategorialGrammar......................................................................61.5CombinatoryCategorialGrammar......................................................61.5.1ApplicationCombinators............................................................81.5.2CompositionCombinators..........................................................91.5.3Type-raisingCombinators..........................................................101.6LiteratureReview.........................................................................111.6.1Supertagging.........................................................................121.6.2CCGsupertagging...................................................................131.7EvaluationMetric........................................................................181.8Dataset.....................................................................................181.9ThesisContributions.....................................................................191.10OrganizationoftheThesis............................................................21Chapter2GatedRecurrentUnitsfortheCCGSupertaggingtask..................242.1Introduction...............................................................................242.2NeuralNetworks..........................................................................242.2.1DeepLearning.......................................................................242.2.2RecurrentNeuralNetworks........................................................272.2.3BidirectionalRNN..................................................................292.2.4GatedRecurrentUnits..............................................................31-VI- Contents2.3BGRUproposedmodelfortheCCGSupertaggingtask............................322.3.1InputLayer...........................................................................332.3.2GRUNeuralNetwork...............................................................342.3.3OutputLayer.........................................................................352.4ExperimentSettings......................................................................362.4.1Dataset................................................................................362.4.2DataPreprocessing..................................................................362.4.3Hyper-ParametersandTraining...................................................362.4.4WordembeddingsSettings.........................................................362.4.5LearningAlgorithm.................................................................372.4.6Dropout...............................................................................372.5ResultsandAnalysis.....................................................................372.5.1SupertaggingResults...............................................................372.5.2Multi-taggingResults...............................................................392.6Summary..................................................................................41Chapter3Backward-BLSTMmodelfortheCCGSupertaggingtask..............433.1Introduction...............................................................................433.1.1LongShortTermMemoryNetworks.............................................443.2Backward-BLSTMproposedmodelfortheCCGSupertaggingtask.............473.2.1InputLayer...........................................................................473.2.2NeuralNetwork......................................................................483.2.3Outputlayer..........................................................................493.3ExperimentsSettings....................................................................493.3.1ExperimentalData..................................................................493.3.2DataPreprocessing..................................................................503.3.3Implementation......................................................................503.3.4Hyper-Parameters...................................................................503.3.5LearningAlgorithm.................................................................513.3.6Dropout...............................................................................513.4ExperimentResults.......................................................................533.4.1SupertaggingResults...............................................................53-VII- 哈尔滨工业大学工学博士学位论文3.4.2Multi-taggingResults...............................................................553.5Summary..................................................................................56Chapter4BLSTM-CRFmodelfortheCCGSupertaggingtask.....................584.1Introduction...............................................................................584.2ModelDescription.......................................................................594.2.1BLSTMNetwork....................................................................594.2.2ConditionalRandomFields........................................................614.2.3BLSTM-CRFproposedmodelfortheCCGSupertaggingtask...............634.3ExperimentSettings......................................................................654.3.1Datasets...............................................................................654.3.2Wordembeddings...................................................................664.3.3OptimizationAlgorithm............................................................664.3.4DropoutTraining....................................................................664.3.5Hyper-ParametersTuning..........................................................664.4ResultsandAnalysis.....................................................................674.4.1SupertaggingResults...............................................................674.4.2Multi-taggingResults...............................................................694.5Summary..................................................................................70Chapter5Character-WordembeddingsfortheCCGSupertaggingtask..........715.1Introduction...............................................................................715.2Character-WordembeddingsproposedmodelfortheCCGSupertaggingtask..725.2.1Word-LevelNeuralNetwork.......................................................725.2.2Character-LevelNeuralNetwork..................................................745.2.3Concatenation........................................................................755.3Experimentssettings.....................................................................765.3.1Datasets...............................................................................765.3.2Hyper-Parameters...................................................................775.4ResultsandAnalysis.....................................................................785.4.1Supertaggingresults................................................................785.4.2Multi-taggingResults...............................................................805.5Summary..................................................................................80-VIII- ContentsConclusions.......................................................................................82References.........................................................................................84PaperspublishedintheperiodofPH.D.education......................................95StatementofcopyrightandLetterofauthorization......................................96Acknowledgements..............................................................................97Resume............................................................................................98-IX- 哈尔滨工业大学工学博士学位论文插图索引图1-1ExampleofPOStaggedsentence....................................................4图1-2ExampleofCCGSupertaggedsentence............................................5图1-3Examplefromsection00oftheCCGBankcorpus...............................19图1-4Dissertationoutlines..................................................................23图2-1AnexampleofanArtificialNeuralNetwork......................................26图2-2AnexampleofaDeepNeuralNetwork............................................26图2-3GeneralstructureofsimpleRNNs..................................................28图2-4GeneralstructureofasimpleRNNunfoldedforthreetimesteps..............29图2-5GeneralstructureofBRNNunfoldedforthreetimesteps.......................30图2-6Illustrationofthevanishinggradientproblem....................................31图2-7GatedRecurrentUnitsarchitecture[64].............................................32图2-8BGRUproposedmodelfortheCCGsupertagging...............................35图3-1FromRNNtoLSTM[87]..............................................................44图3-2LongShort-TermMemorynetworkarchitecture..................................45图3-3Backward-BLSTMmodelfortheCCGsupertagging............................49图3-41-bestaccuracyofourBackward-BLSTMproposedmodelonthedevelopmentsetwithandwithoutdropout........................................52图4-1DeepBLSTMarchitecturewith2-BLSTMLayers...............................60图4-2CRFGraph.............................................................................62图4-3Theneuralnetmechanism...........................................................64图4-4BLSTM-CRFnetworkmodelfortheCCGsupertagging........................65-X- Contents图5-1Wordlevelneuralnetwork...........................................................73图5-2Characterlevelneuralnetwork......................................................75图5-3Word-CharacterbasedembeddingsmodelfortheCCGsupertagging.........76-XI- 哈尔滨工业大学工学博士学位论文表格索引表2-1Thefinalchosenhyper-parameters.................................................38表2-2Performancecomparisonwithstate-of-the-artmethodsonthedevel-opmentset..............................................................................38表2-3Performancecomparisonwithstate-of-the-artmethodsonthetestset........39表2-4Performancecomparisonofdifferentmodelsformulti-taggingaccu-racyonSection00fordifferentlevels...........................................41表3-1Comparisonoftheaccuracyresultsonthedevelopmentsetusingdifferentwordembeddings...........................................................50表3-21-bestAccuraciesresultswithandwithoutdropoutondevelopmentandtestdata............................................................................51表3-3Thefinalchosenhyper-parameters.................................................52表3-41-bestaccuracyonthedevelopmentset(Section00)............................53表3-51-bestaccuracyonthetestset.......................................................54表3-61-bestaccuracycomparison.........................................................54表3-7Performancecomparisonofdifferentmodelsformulti-taggingaccu-racyonSection00fordifferentlevels...........................................56表4-1Thefinalhyper-parameterssettingsforourmodel...............................67表4-2Performancecomparisonwithstate-of-the-artmethodsonthedevel-opmentset..............................................................................67表4-3Performancecomparisonwithstate-of-the-artmethodsonthetestset........68表4-4Performancecomparisonofdifferentmodelsformulti-taggingaccu-racyonSection00fordifferentlevels...........................................69-XII- Contents表5-1Accuracyresultsonthedevelopmentset..........................................78表5-2Accuracyresultsonthetestset......................................................79表5-3Performancecomparisonofdifferentmodelsformulti-taggingaccu-racyonSection00fordifferentlevels...........................................80-XIII- 第1章Introduction第1章Introduction1.1MotivationNowadays,computersplayanintegralroleinthedailyhumanlivesasthemostbrilliantgiftofscience.ComputationalLinguistics(CL)isaspecializeddisciplinaryintheapplicationofcomputerstotheprocessingofnaturalhumanlanguages.ThemaingoalofCListoenablecomputerstounderstandhumanlanguagesandmanipulatetheminvolvingknowledgefromlinguistics,computerscience,logic,cognitivescienceaswellasothersciences.CLcanbedividedintomanysubfieldsthatcanbebranchedintoseveralresearchareassuchasMachineTranslation(MT),zeropronounresolution,QuestionAnswering(QA),NaturalLanguageUnderstanding(NLU),speechrecognitionandparsing.ThesetasksareconsideredasNP-completeNLPproblems.Tobuildthosehigh-leveltasks,manypreliminarytasksshouldbetakenintoaccountsuchastokenization,InformationExtraction(IE),anaphorareferenceresolution,andsequencelabelingtasks,amongothers.SequencelabelingorstructuredpredictiontasksarerequiredtosolvemanyareasproblemssuchasNLP,andbioinformaticslikeproteinsecondarystructureprediction.Structuredlearningcorrespondstothetaskofassigningalabeltoeachelementofaninputsequence.InNLP,sequencepredictioncorrespondstoavastrangeofproblems.ThemostearliestandfamoussequencelabelingproblemisprobablyPart-Of-Speech(POS)taggingwhereeachwordinasentenceislabeledwithPOSclassessuchasNoun(N),Verb(VB),Adjective(JJ),Pronoun(PRP),Adverb(RB),etc[1].AnotherexampleisIEthataddressestheproblemofidentifyinginstancesofclassesincludingNamedEntityRecognition(NER)whichconsistsofdefiningentityinformationlike:person,location,time,organization,etc[2].Thereisalsotheco-referenceresolutionthataimsatidentifyingmultiplereferencesofthesameentityinatextwhichcanbeaname,pronominal,etc[3]andanotherexampleofsequencelabelingproblemsissupertaggingthatreferstoassignasingleappropriatesupertagtoeachwordinaninputsentence.ManyoftheearlypioneersofCLresearcherswereinterestedintheareaofsequencelabelingsinceitseemssousefulformanytasks.Inthelastfewdecades,supertagging-1- 哈尔滨工业大学工学博士学位论文attracttheattentionofseveralresearchersandbecomemoreandmoreimportantformanyNLPtasksasaprimarystepbeforemanyapplicationssuchasparsing[4],languagemodeling[5]andtextsimplification[6].SupertaggingappearslikePOStaggingwhereeachwordinasentenceistaggedwithasupertagcategory.ItwasatthebeginningproposedforLexicalizedTreeAdjoiningGrammar(LTAG)[7]andthenappliedtoothergrammarsformalismsuchasProbabilisticContext-FreeGrammar(PCFG)[8]andCCGgrammar[9].CCGisthegrammarformalismusedbyhuman[10]thatprovideanaturallinkagebetweensyntacticstructureandsemanticrepresentation.Furthermore,comparingtoothergrammars,itoffershighflexibilitybecauseitallowsderivingthestructureforanypartofasentencewithouttheneedtoderivethestructureofthewholesentence.TheapplicationofthesupertaggingtotheCCGgrammarisoftenreferredas"CCGSupertagging"andconsistsofassigningaCCGsyntacticcategorytoeachwordinaninputsequence.However,themainchallengingproblemoftheCCGsupertaggingtaskcomparedtoPOStaggingasbothareconsideredassequencelabelingproblemsisrelatedtothehugesizeofthecategorysetoftheCCGsupertagscomparedtoPOStagsassupertagscontainmuchrichinformation.Moreover,manywordstakemultipleCCGsupertagswherethesizeofthepredictedsupertagsmaybeverylarge.Inliterature,dominantapproachesbasedonmachinelearningmethodshavebeenproposedfortheCCGSupertaggingtasksuchasHiddenMarkovModels(HMM)andConditionalRandomFields(CRF).However,theuseofmachinelearninginNLPhasbeenmostlylimitedtonumericaloptimizationofweightsforhumanlydesignedrepre-sentationsandfeaturesfromthetextdata.TheneedtoautomaticallylearningfeaturesorrepresentationsfromarawtextiscrucialforawiderangeofNLPtasks.Duringthepastseveralyears,therehasbeenalonghistoryofusingNeuralNetworks(NNs)wheretheymakemajoradvancesinsolvingproblemsthathaveresistedthebestattemptsofstatisticalmachinelearningmethodsformanyyearsandimpactawiderangeofinformationprocessing.NNbasedmethodshavebeenshowntoperformwellfortheCCGsupertaggingtask.ThemostattractiveaspectofNNmethodsistheirabilitytoperformthesetaskswithoutexternalhand-designedresourcesortime-intensivefeatureengineering.Tothisend,ArtificialNeuralNetworks(ANNs)developandmakeuseofan-2- 第1章Introductionimportantconceptcalled"embeddings"[11]thathavebeenprovedtobemoreeffectiveandhavebeenwidelyadoptedintheNLPcommunitywhichconsistsofturninginputs(i.e.words)intoarepresentation(i.e.avectoroffloats)thatNNscanmanipulate.Inpresentyears,deepneuralnetworks,ormorecommonlycalleddeeplearninghasemergedasthenewareaofmachinelearningresearchthatallowsthepropositionofstrongmodelstoovercometheshortcomingsofbothstatisticalmachinelearningandshallowNNswithagreeablemodelssuchasRNNsandLongShort-TermMemory(LSTM)networks.Today,deeplearninghasbecomethenewapproachfordevelopinghigh-performancemodelsandhaveshowntosignificantlyimprovetheefficiencyofnumeroussystems.Inthisthesis,ourmainobjectiveistousedeeplearningtechniquestosolveasequencelabelingproblem.WefocusonthetaskofsupertaggingfortheCCGgrammar.ThemostimportantproblemsinCCGsupertaggingprocessisrelatedtolearninglongsequences,thedependencebetweeninputsandoutputsandthelargenumberoftheCCGlexicalcategories.RecurrentnetworkssuchasGatedRecurrentUnits(GRUs)andLSTMswerechosenforthisworkbecause,amongthefamilyofdeeplearningtechniques,LSTMsandGRUsareratedasthebesttomodelsequentialdatafortheircapabilitytostoreinformationforlongtimewhichisveryusefulforourtask.1.2TheCCGSupertaggingTaskInNLPresearch,learningtasksareprimaryandcomplicatetoperformwhereweareusuallyrequiredtosolveasetofnecessaryproblemstogetherinordertosolveotherproblemswithrespecttosomeelementarystructure.Thisisusuallycalledstructuredlearning.StructuredlearningorsequencelabelingtasksarethemostwellstudiedproblemsintheNLPliteratureasthegenerictasksofassigninglabelstotheelementsofasequence.Sequencelabelingcorrespondstoawiderangeofreal-worldproblems.ThemostpopularsequencelabelingproblemisPOStaggingwhereeachwordislabeledwithaPOStag.However,itisknownthatnaturallanguagegrammarisambiguous.Inotherwords,givenanaturallanguagegrammar;onesentencemighthaveseveralvalidstructureswhereeachwordmaytakemultipletags.Figure1-1showsanexampleofasentencewiththecorrespondingPOStagswhereeachwordisassociatedwithmultiplePOStags(tagsindoubleboxesarethecorrecttags)[12].-3- 哈尔滨工业大学工学博士学位论文图1-1ExampleofPOStaggedsentence.Theterm’Supertagging’,nowwidelyusedinNLP,wascoinedbyJoshiandBan-galore[13]andthebeginningperiodofCCGsupertaggingresearchwasinthe2000-02s,Indefiningthetask,similartoPOStagging;theCCGsupertaggingcanbeviewedastheprocessof"assigning"wordsinatextintoaparticular’CCGlexicalcategory’.TheCCGsupertaggingisasupervisedsequencelabelingtaskwhereauserprovidesatrainingsetwithsentencesandtheircorrespondinglabelsandwantstolearnandtrainamodelabletolabelnewunseensequences.Trainingexamplesconsistofpairs(x,y)wherex2Xisaninputsequenceofelements(x1;x2;:::;xt)andy2Yisthecorrespondingsequenceoflabels(y1;y2;:::;yt),whereeachlabelytisthelabelthatcorrespondstotheelementxt,andtheytlabelsbelongstothelabeldictionarydenotedbyL.TheCCGsupertaggingcanbeformulatedasfollows,givenasequenceofinputwords(x1;x2;:::;xn),weaimtocomeupwiththecorrespondingCCGoutputs(y1;y2;:::;yn)fromthesetoflabels{L}whichguarantee:S=argmaxP¹y1;y2;:::;ynjx1;x2;:::;xnº(1-1)ComparedtostandardPOStagging;CCGsupertaggingismuchcomplicateasthePOStagsetisoftensmallerthantheCCGlexicalcategorysetusedfortheCCGsupertag-ging,becauseCCGsupertagsincludelongdistancedependenciesandcontainmorerichinformationthanPOStags.Inotherwords,therearemanyCCGsupertagsperwordthanPOStags.Figure1-2givesanexampleofCCGSupertaggedsentence.AsweareusingabiggersupertagssetcomparingtothesizeofPOStagset,thenumberoftheprobablyassociatedCCGsupertagstoeachwordincreases.-4- 第1章Introduction图1-2ExampleofCCGSupertaggedsentence.1.3ApplicationsofCCGSupertaggingMostofNLPapplicationsareconstitutedbyasetofdifferentcomponents;eachmod-uleiscrucialtoaspecificanalysisofnaturallanguagetext.ThetaskofCCGsupertaggingisoneofthefundamentalNLPtasks,andisveryimportantbecauseitinfluencesvariousapplications.Inthefollowing,webrieflydiscusssomeoftheapplicationsthatbenefitfromCCGsupertagging.•Parsing:parsingisthetaskofretrievingavalidstructuretoastringorlistoftokensgivenanaturallanguagegrammar.inNLP,parsingiscentraltomanyNLPtaskssuchasQA,MTandInformationRetrieval(IR).CCGsupertaggingisapreliminarystepthatshouldbetakenintoaccountbeforefullparsingastheinformationencodedinCCGsupertagsmakestheparsingmoresimpler.CCGsupertaggingservesastheinputtomanyparserssuchastheC&Cparser[14]anditprovidesexcellentperformanceandreducetheparsingcomplexity.•MachineTranslation:syntax-basedmethodsrelyingonpowerfulgrammarsformalismspromisetomodeltranslationinamorenaturalway[15].CCGsupertaggingisalsoacrucialpartforMTsystems,bymappingwordswiththeircorrespondingCCGsupertags;ithelpstomodelexplicittarget-syntaxinNeuralMachineTranslation(NMT)systems[16]andbenefitfromthestructurallyrichCCGsyntacticcategoriesduetotheCCGgrammarabilitytorepresentnon-constituentsinasyntacticwaywhichfrequentlyoccurinbothsourceandtargetlanguagesforMTsystems[15].•QuestionAnswering:onemostinterestingNLPtasktackledbyCLcommunityisthatofknowledgeQuestionAnswering(QA)systems.InQAsystems,CCGsupertaggingisproventobeusefulintheparsingofquestionsthatincreasetheparsingaccuracyonquestionsproducingsuitableparsersforquestionsofQAsystems[17]toextractsomepieces-5- 哈尔滨工业大学工学博士学位论文ofinformationfromthequestionwhichhelptoeasilyretrievetherightanswers.ThemainadvantageofusingCCGsupertaggingforQAsystemsisthatwecandirectlygetsemanticrepresentationsofquestions.1.4CategorialGrammarCategorialGrammar(CG)[18]coversafamilynumberoftheoldestlexicalizedgram-marsproposedforthesyntaxandsemanticsofnaturallanguagesaswellaslogicalandmathematicallanguages[19].InCGgrammar,themainandentireresponsibilityofdefin-ingthesyntacticformiscarriedbythelexicon,alongwithothergrammarssuchasHeadDrivenPhraseStructureGrammar(HPSG),TreeAdjoiningGrammar(TAG),LexicalFunctionalGrammar(LFG),etc.CGgrammarconsistsoftwoparts:alexicon,whichassignsacategorytoeachbasicsymbol,andsomeinferencerules,regroupinganumberofsyntacticandsemantictheorieswhereallexpressionsareseparatedbyasyntactictypeidentifyingthemasargumentsoffunctionsbuiltfromatomicandelementaryarguments.OneoftheearliestextensionsofCGwas"CombinatoryCG"byextendingthecoreofCGwithfunctionaloperationsonadjacentcategories,suchasfunctionalcomposition[20].1.5CombinatoryCategorialGrammarTherearevariousgrammarframeworksproposedfornaturallanguages,CCGcon-stitutesanimportantclassofCGlexicalizedgrammarformalismsthathavebeenarguedtobetheformalismusedbyhumansbecauseitprovidesanaturallinkagebetweensyn-tacticstructureandsemanticrepresentation[10].Moreover,CCGoffershigherflexibilitycomparedtoothergrammars;itcanderivethestructureforanypartofasentencewithoutderivingthestructureforthewholesentence[21].TheCCGgrammarassociatesrichsyn-tactictypeswithwords.Inthelastfewdecades,CCGhasbeenusedinseveralaspectsofnaturallanguageunderstanding,e.g.,parsing[22][23][24],semantics[25][26],andavastrangeofNLPapplicationssuchasMT[27][16].CCGgrammarisbasedontheCGgrammarformalismandisdevelopedbySteed-man[28].TheprimitiveelementsofCCGarecategories.ThesyntactictypeofCCGgrammarassumestwocategorytypes:atomicorcomplex:1.Atomiccategories:thebasicvocabularyofsimplecategoriescanbeSentence-6- 第1章Introduction(S),Noun(N),NounPhrase(NP)andPrepositionalPhrase(PP).2.Complexcategories:complextypesareoftheformA/BandAnBrepresentingfunctionsthatcombineanargumentoftypeBtoyieldtoAasaresult.TheyarebuiltbythecombinationofatomiccategoriesorcomplexcategoriesthemselvesbyslashesindicatingwhethertheBargumentprecedes(n)orfollows(/)thefunctor.Inotherwords,A/Bmeansthattheargumentshouldappeartotheright,whileAnBdesignatesthattheargumentshouldappearontheleft.InCCGgrammar,alexicalcategoryisassignedtoeachsymbolofsequence(i.e.,words).FollowingareexamplesofEnglishentriesassociatedwiththeirpossibleCCGlexicalcategories:{he,girl,lunch,...})N{good,the,eating,...})N/N{sleeps,ate,eating,...})SnN{sees,ate,...})(SnN)/N{quickly,today...})SnS{good,the,...})(SnN)/(SnN)Unlikecontext-freegrammarswhichencodetheinformationaboutstructurewithruleslike:S)NPVP;VP)VNP.InCCGgrammar,thestructureisencodedincategoriesthatdonotneedsuchrules.However,thelexicalcategoriesassociatedwithwordsinasequencedeterminehowthesewordscanbecombinedwithothercategoriestoappearinanacceptableorder.Therebytheconceptofcombinatorswasintroducedwhereelementarycategoriesarecombinedbycombinators.TheCCGgrammardefinesanumberofcombinatorsthatallowcombiningoneortwocategoriesintoanewcategory.Inthefollowing,threedifferenttypesofcombinatorswillbeintroduced.ThemostcommonCCGcombinatorsthatcombineelementarycategoriesnamelyaretheApplication,Composition,andlastlyType-raisingcombinators.-7- 哈尔滨工业大学工学博士学位论文1.5.1ApplicationcombinatorsCCGoperatesbyfirstassigningasyntacticCCGcategorytoeachsymbol(i.e.,word)inagivensequenceandthebackwardandforwardslashesallowtocombinelexicalcategories,asdescribedintheprecedingsection.Giventhecategoryassignments,aderivationtocombinewords’categoriesproceedsbycombiningthecategoriesusingcombinators.InCCGcombinators,thesimplestisapplicationcombinators.Theapplicationofcombinatorsareforwardandbackwardapplicationswhichareoftendenotedby>and<,respectively.Fortheforwardapplication,thesyntacticcategoryofthetypeas’A/B’indicatesthattheargumentBshouldappearontheright.Inotherwords,thesyntacticcategoryofatypeasA/BtakesBasanargumentontherightandthecombination"A/BB"resultsinthecategoryA.Mathematically,asfollows:AB:fB:a!A:f¹aº(1-2)Asanexample,theassociatedCCGlexicalcategorytotheword’powerful’corre-spondstoafunctionthatmapsfromthedomainofnounsNintotherangeofnounsNresultingin"N/N."Theassociationofthisitemwithafunctionisrepresentedbywriting:Powerful!N/Nandtheword’girl’canbeassociatedtotheatomiccategoryN:girl!NTheargumentNofthefunctionN/Noftheword’powerful’appearstotherightoftheforwardslashcharacter,andthevalueoftheresultNisontheleft.ThefactthattheslashinthefunctionalN/Ntypeslantsrightwardindicatesthatanounmustappeartotherightofthecategorywithwhichitwillbecombined.ApplicationofthefunctionN/Nassociatedwiththeword’powerful’andthefunctionNoftheword’girl’,resultsinthesubstring’powerfulgirl’beingcombinedtohaveanatomiccategorywithasyntactictypeasN.Theforwardapplicationforthisexamplemayberepresentedasfollow:-8- 第1章IntroductionIncontrast,Forthebackwardapplication,thesyntacticcategoryofatypeas’AnB’indicatesthattheargumentBshouldappearontheleft.Andthesyntacticcategoryofatypeas’AnB’takesBasanargumentontheleftandthecombination’BAnB’resultsinthecategoryA.Mathematically,thebackwardapplicationisdefinedasfollows:B:aAnB:f!A:f¹aº(1-3)Forexample,theword’day’canbeassociatedwiththesyntacticcategorytypeSnNP,thebackslashinSnNPindicatesthatanNPargumentmustbetotheleft.Iftheitem’nice’isassociatedwiththeatomictypeNP,thenthestring’niceday’canbecombinedasasentence,withatomicsyntactictypeS.Inthisinstance,thebackwardapplicationoperationcanberepresentedasfollows:1.5.2CompositionCombinatorsCompositioncombinatorsarecombinatoryoperationsthatareneededforthear-rangementofinputsentences.Theinputforthecompositioncombinatorsaretwocomplexcategories,andtheoutputisalsoacomplexsyntactictypecategory.Similartotheap-plicationcombinatorsbothforwardandbackwardcompositioncombinatorsaredefined,schematicallyas(>B)and(Bº(1-4)BnCAnB!AnC¹B,thedomainofthefirstlexicalcategoryshouldcorrespondtotherangeofthesecondcategoryresultinginanewfunctionwiththerangeofthefirstlexicalcategoryandthedomainfromthesecond.Forexample,theitem"the"associatedwiththe"NP/N"category,indicatesthatanNargumentmustappearontherangeofasecondcategoryastheword"beautiful"withthesyntactictype"N/N".Thenthestring’thebeautiful’canbecombinedwiththeforwardcompositioncombinatorasfollowing:Forthebackwardcomposition,notedasforforwardtype-raisingandTTº(1-6)BackwardType-raising:A!Tn¹TAº¹>Tº(1-7)WhereTisavariabletype,ingeneral,thevariableTrepresenttheS(Sentence)categorytype.-10- 第1章IntroductionForexamplethecategoryofsyntactictypeNPassignedtotheword’grammar’becomesafunctionalcategorywithforwardtyperaisingcombinatorasfollows:andtheapplicationofthebackwardtyperaisingresultsinthefollowingfunction:Tosumup,thefollowingexampleillustratetheuseofthethreecombinatorstocombinetheCCGlexicalcategoriesassociatedtoeachwordinthesentence"AnnalovesDavid"whereaforwardtyperaisingfunctionisappliedtothe"NP"syntactictypemappedwiththeword"Anna"resultinginthecomplexcategoryoftype"S/(SnNP)"sothatitcanbecombinedwiththecategory"(SnNP)/NP"withaforwardcompositiongivingacategoryoftype"S/NP",andfinally,theresultingcategory"S/NP"canbecombinedwithcategoryNPassignedtotheitem"David"withaforwardapplicationresultinginaScategoryasthefinalresult,asfollows:1.6LiteratureReviewTheareaofsupertagginghasbeenenrichedoverthelastfewdecadesbythecontri-butionfromseveralresearchers.Sinceitsinceptionattheendofthenineties[7],manynew-11- 哈尔滨工业大学工学博士学位论文conceptshavebeenintroducedtoimprovetheefficiencyofsupertaggingandtoconstructsupertaggersforseveralgrammars[29]andlanguages[30][31].Morerecentlyseveralmodelshavebeenusedforthesupertaggingtaskforprovidingadaptivetaggers.Severalsophisticatedmachinelearningalgorithmshavebeendevelopedthatacquiremorerobustinformation.Ingeneral,allmachinelearningmodelsrelyonhand-craftedfeaturestoprovidegoodresults.Hence,someoftherecentworksfocusondeeplearningmodelstocopewiththeproblemoffeaturesextraction.Finally,combinationsofseveralmachinelearninganddeeplearningmodelshavebeenusedinthecurrentresearchdirection.Thissectionprovidesabriefreviewofthepriorworkonsupertagging.Tobeconscious,wedonotaimtogiveacomprehensivereviewoftherelatedwork.Instead,weprovideabriefreviewofthedifferenttechniquesusedinsupertagging.Further,wefocusonthedetailedreviewoftheCCGsupertaggingexistingmethods.Firstly,wewillprovideabriefdiscussionontheworkperformedaroundsupertaggingingeneral.Then,wediscusstheapplicationofmachinelearningalgorithmstoaddresstheCCGsupertaggingproblem.Lastly,wediscussthemostrecenteffortsthathavebeendoneinthisarea.1.6.1SupertaggingSupertaggingwasfirstproposedbyJoshiandBangalore[13]forLexicalizedTree-AdjoiningGrammar(LTAG)asequivalenttoPOSphrasalgrammarswiththedifferenceisthatthesetsofPOStagsaresmallerthanthesetsofsupertagsusedinlexicalizedgrammars.ComparingtoPOStags,supertagscontainmuchmoredetailedsyntacticinformation.Tofurnishthissupplementaryinformationthesetsofsupertagsmustbemuchlarger.Usually,asupertagsetcontainshundredsoftags.Forinstance,thesetofLTAGsupertagshad3964tags[32]wheremostPOStagsetscontainlessthanfiftypossibletags[1].Whensupertagging,evenifthesetoftagsavailableforeachwordisminimizedtothoseobservedintrainingdata,thesetofsupertagsthatcouldbeassignedtoeachwordisstilllarge.StatisticalmachinelearningmethodswereusedtostandardPOStaggingdisambiguation,inthesameway,earliestworksonsupertaggingusethelocalstatisticalinformationintheformofn-grammodelsforthedistributionofsupertags;the-12- 第1章Introductionfirstsimplestmodelforsupertagsdisambiguationusestheunigrammodelandselectsasingletagforeachwordbasedonitslocalcontext[13].Themainobjectiveofthismodelwastodetermineforeachword,thesupertagwithwhichitismostoftenassociated.Unfortunately,themainproblemwiththeunigrammodelisthatitdoesnotaccountforcontext.Thisisasourceofmanyerrorsthatthismodelmakes.Laterappearstrigrammodelscalledthetrigramapproximationforthereasonthattheresultingprobabilityusesthetwoprecedingtagsti 1andti 2ascontextwhenpredictingtheprobabilityofthecurrenttagti.Bydoingso,thecurrenttagtiisconditionedontwoprevioustagsofcontext.Afterthat,Two-PassHeadTrigram[33]modelwereproposedbymakingadifferentcontextualapproximationthanthatmadebythetrigrammodel,thetwo-passheadtrigrammodel[33]attemptstoovercomesomeofthemistakesthatthetrigrammodelmakes.Differentlytothetrigrammodelwhichalwaysconditionstheprobabilityofthecurrentsupertagonthesupertagsoftheimmediatelytwoprecedingwords,thetwo-passheadtrigrammodelconditionsonthesupertagsoftheimmediatelytwoprecedingheadwords.Allthoseworksshowthatlocalsupertagfeaturesareeffectiveinsupertagdisam-biguation.Sincethesupertagsencodedependencyinformation,theinformationaboutthedistributionofdistancesbetweenagivensupertaganditsdependentsupertagscanalsobeused.Chen[34]showsthatlongdistance"headsupertag"featuresarealsoeffective.Chen[34]redefinesthenotionofheadednessintermsofsupertagsthemselveswhichen-ablesthedevelopmentoftheone-passheadmodel.Healsoshowsthatnotonlystructural(supertag)featuresbutalsolexicalfeaturescanbeimportant.Chen[34]showsthatitisnotonlyimportanttoidentifyimportantfeatures,butitisalsoimportanttodesignanappro-priateframeworkinordertousethosefeatureseffectively.SimilartoRatnaparkhi[35]forPOStagging,Chen[34]developedMEMMmodelforsupertagging.Moreover,Chen[34]implementedseveralsupertaggersbasedondistinctfeaturesets.1.6.2CCGsupertaggingInliterature,themostpopularapproachestosolvesequencelabelingproblemsusestatisticalmachinelearningtechniques.Theseapproachesprimarilyconsistofbuildingstatisticalmodelsthatassignawordsequencewiththemostprobabletagsequencebygivingthesequenceofwordsinamaximumlikelihoodapproach.Further,feature-basedclassificationalgorithms(e.g.,MaximumEntropy(ME)models,CRF,SupportVector-13- 哈尔滨工业大学工学博士学位论文Machines(SVM),etc.)havebeenwidelyusedandachievedgoodresults.ThefollowingdescribessomeoftheproposedmethodsfortheCCGsupertaggingproblem:SinceBangaloreandJoshi[7]havedemonstratedthatmethodsusedforPOStaggingcanbesuccessfullyappliedtothesupertaggingproblem,Clark[9]wasinspiredfromPOStaggersandLTAGsupertaggerstobethefirstwhoperformsupertaggingtoCCGgrammar.However,ratherthanusingHMMmodelsthatweresuccessfullyusedforPOStagging[36];Clark[9]developedasimilarmodelusedforPOStaggingbyRatnaparkhi[35]basedonMEmodels.InsimpleHMMmodelsincorporatingadiversesetoffeaturesisdifficult.Incontrast,anMEmodelcaneasilyintegratediversefeatures.Clark’s[9]supertaggerusesconditionalMEmodelstoestimateprobabilitiesofeachwordbeingassignedeachpossibletag,givenalocalcontext.Theseprobabilitiesarethenusedtoselectallpossibletagsthatcouldbeassignedtoaword.InClark’s[9]model,theprobabilityofacategoryC,givenacontextwindowhiscalculatedasfollows:1∑P¹Cjhº=eiifi¹c;hº;(1-8)Z¹hºwhereZ(h)isanormalizationconstant,fi¹c;hºdefinesthefeaturesofthecategoryCandthecontexth,andtheiweightcorrespondstothefeaturethatcontributetotheprobabilityP(C|h)whenthecontexthcontains"the"asthecurrentword.SimilartoRatnaparkhi[37],Clark[9]usesacontextualpredicate"current_word_is".Forexamplethefunctionf¹c;hºicanbeasfollows:8>><1ifcurrent_word_is”the”¹hº&C=NPNf¹c;hº=(1-9)>>0Otherwise:Inotherwords,thefunctiontakesavalueof1ifthecurrentwordis"the"andthecategoryisNP/N.Thepredicatecontextualidentifiesthecontextthatmightbehelpfulinpredictingalexicalcategory.Clark[9]experimentswithanumberofcontextualpredicatesthatincludethecurrentword,propertiesoftheword(suchassuffixesandprefixes),POStagofthecurrentword,wordsonthesideswiththeircorrespondingPOStagsandhealsoexperimentswithdifferentwindowsize.Finally,themostprobablesequenceofcategoriestoagivensentenceisdefinedas-14- 第1章Introductiontheproductoftheindividualprobabilitiesforeachcategory,asfollows:P¹cjhº=iP¹cijhiº(1-10)Duringtraining,thesupertaggerconsultsatag-dictionary,whichcontains,foreachword,thesetofcategoriesthewordwasseenwithinthedata.IfawordappearsatleastKtimes,thesupertaggeronlyconsidersthecategoriesintheword’scategoryset.IfawordappearslessthanKtimes,allcategoriesareconsidered.AfterabeamsearchalgorithmisusedtoretainonlytheN=10sequences.Clark[9]showshowthemodelcanbeusedtodefineamulti-taggerwhichcanassignmorethanonecategorytoeachword.ClarkandCurran[14]followClark[9]andassumealog-linearMEmodelwhereanaturalcombinationofseveralfeatureshasbeenincorporated.ClarkandCurran’s[14]modelusewordsandPOStagsplusthetwopreviouslyassignedlexicalcategoriestotheleftasfeaturesinthefive-wordwindowtodefineadistributionoverthelexicalcategorysetforeachlocalcontextcontainingthetargetword.Theyalsousedatagdictionarywhereeachentryisalistofallthecategoriesseenwiththewordinthetrainingdata.Thesupertaggerassignscategorieswhichhavebeenseenwiththewordinthedataforwordsseenatleastk=20timesandassigncategorieswhichhavebeenseenwiththePOStaginthedatatothewordsseenlessthanktimes.ThesetofthelexicalcategoriesusedbyClarkandCurran[14]isthesetofcategoriesthatappearatleastten(10)timesinSections02–21oftheCCGBankcorpus[38][39]resultingin425categoriesbecauseithasveryhighcoverageonunseendata[40].ThismodelreliesheavilyonPOStagstocome-upwithunknownandunseenwordsandisverysensitivetothequalityofthosetags;thisiswhythatitsperformancedecreasesaggressivelyoutsideofitstargetdomainwiththepresenceofunseenandrarewords.Following[41][11]forPOStagging,LewisandSteedman[23]werethefirsttoexplorefeed-forwardNeuralNets(NN)withunsupervisedwordembeddingsasfeaturesinsu-pervisedmodelsfortheCCGsupertaggingtask.Theuseofunsupervisedvector-spaceembeddingsofwordsallowsthemodeltobetterassignlexicalcategorieswithoutde-pendingonPOS-tagsasfeatures.Thenetworkusesfeaturesof3-wordcontextwindowsurroundingaword.ThekeyfeatureiswordembeddingsratherthanPOStags,initial--15- 哈尔滨工业大学工学博士学位论文izedwiththe50-dimensionalembeddingstrainedin[41]andfine-tunedduringsupervisedtraining,wordswhichdonothaveanentryinthewordembeddingsarereplacedby"un-known"embedding.Themodelalsouses2-charactersuffixesandcapitalizationfeatureswithsomesimplepreprocessingtechniques(i.e.,wordsarelower-cased,andalldigitsarereplacedwith0.Ifanunknownwordishyphenated,backing-offtothesubstringafterthehyphen).LewisandSteedman[23]predictCCGlexicalcategorieswithnsimilarneuralnetworktothatusedbyCollobertetal.,[11]forPOStaggingusinglookuptables.Wordembeddingsandnon-embeddingfeatureswereimplementedwithlookuptableswhichmapeachfeatureontoadimensionalspacevector.Theneuralnetconsistsofthreelayers:thelookuplayerwhichmapswordsanddiscretefeaturesintovectorembeddingswithafixeddimension,thehiddenlayerwithahard-tanhactivationfunctionthatmakestheclassifiernon-linear,andtheSoftmaxtransferfunction,whichtakesthoseinputs,andoutputsaprobabilitydistributionoverlexicalcategoriesforthewordinthecenterofthecontextwindow.LewisandSteedman[23]followTurianetal.,[41]inusingalinearchainCRFastheprobabilityofeachsupertagisconditionedonthesurroundingsupertags[42].Sothattheprobabilitytopredictacategorydependsonwordembeddings,capitalization,andsuffixesasfeatures-aswellasthepreviouslypredictedcategory.WhentraditionalNNsareused,allinputsandoutputsareindependentofeachother,andonlyafixedlengthofpredecessorwordsisusedtopredicttheprobabilityofthecurrentwordtobeassignedtoaspecificsupertag.Althoughitisnecessarytotakeintoaccountallthepreviouscalculations.Forinstance,fortheCCGsupertaggingtask,topredictaCCGlexicalcategorytoagivenwordinasentence,itisobvioustoknowthepreviousinformationaseachoutputisdependentonthepreviouscomputations.Forthisneed,thecurrentdirectionofresearchincludestheuseofmoresophisticatedmodelstoprocesssequentialinformationmainlybasedondeeplearningmethods.RNNswereproposedinthe80’s[43][44]formodelingtimeseriesandsequentialdata.ThestructureofRNNsissimilartothatofastandardmultilayerperceptron,withthedistinctionthatitallowsconnectionsamonghiddenunitsassociatedwithatimedelay.Throughtheseconnections,themodelcansaveandkeepinformationfromthepastandperformthesameprocessforeveryelementofasequencewiththeoutputbeingdependentontheprevious-16- 第1章Introductioncomputationsenablingittodiscovertemporalcorrelationsbetweeninputsthatarefarawayfromeachotherinthedata.Recently,alotofworkhastakenplaceontheconstructionofpowerfulCCGsu-pertaggers,Xuetal.,[45]exploitedRNNfortheCCGsupertaggingtask.Intheory,intheCCGsupertaggingtask,whenanRNNisused,thefullsequenceofpredecessorcompu-tationsisconsideredforpredictingthecurrentcategory.Xuetal.,[45]modelwasbasedonthreemainfeaturessimilarlytoLewisandSteedman[23]whichare,capitalization,suffixesandtheuseofwordembeddings[41]whichenablethemodeltodependonanylexicalorhand-craftedfeaturesandtheyperformsomedatapreprocessingsuchas;allwordsarelower-cased,alldigitsarereplacedbyasingledigit,etc.TheirworkrevealedtheeffectivenessofrecurrentnetworkstotheCCGsupertaggingtask.ForCCGsupertaggingaswellasmanysequencelabelingtasks,itisbeneficialtohaveaccesstofutureaswellaspastcontext.BidirectionalRecurrentNeuralNetworks(BRNNs)[46][47][48]offeramoreelegantsolution.ThebasicideaofBRNNsistopresenteachtrainingsequenceforwardsandbackwardsintwoseparaterecurrenthiddenlayers,bothofwhichareconnectedtothesameoutputlayer.Thisprovidesthenetworkwithcompletepastandfuturecontextforeverypointintheinputsequence.Again,Xu[49]inanotherresearchworkprovesthatBRNNsconsistentlyoutperformunidirectionalRNNsontheCCGpredictionproblem.Whileinprinciplerecurrentnetworksaresimpleandpowerfulandcanlearnfromlongsequencestoretaininformationabouttheirhiddenstateforalongtime.Inpractice,itisverydifficulttotrainproperlyandtogetthemtoefficientlyusethisabilitytomemorizeinformationforlongdistances.AmongthemainreasonswhytheRNNsmodelsaresounwieldy,arethevanishinggradientandexplodinggradientproblemsdescribedinBengioetal.,[50].Toavoidthevanishing/explodinggradientdescentproblemsassociatedwithRNNs,manyauthorsmadenumerousattemptstoaddressthisissuesuchassskipconnec-tions[51][52],hierarchicalarchitectures[53],leakyintegrators[54],second-ordermethods[55],andregularization[56].Amongall,LSTMnetworks,inventedbyHochreiterandSchmid-huber[57]werethebestproposedrecurrentnetworkstocomeupwiththedifficultytotrainvanillaRNNsandsolvethevanishinggradientproblem.-17- 哈尔滨工业大学工学博士学位论文Lastly,Lewisetal.,[58]andVaswanietal.,[59]researchworkwasamongtheuseofLSTMrecurrentnetworkstotheCCGsupertaggingtasktoovercomethedrawbacksoftheRNNsbasedmodels.Lewisetal.,[58]andVaswanietal.,[59]usedBLSTMrecurrentnetworksbecausetheyarebestsuitedforthestructuredCCGsupertagginglearningtasktoprocessandpredicttimeserieswithtimelagsfrombothleftandrightdirections.AccordingtoLewisetal.,[58]andVaswanietal.,[59],theyuseddifferentarchitecturesandtheirfindingsprovethatLSTMnetworkscanlearnmuchlongerhistoricalinputinformationthantraditionalRNNs.1.7EvaluationMetricThegoalofmachinelearningmodelsistolearntogeneralizewellforunseenexamplesinsteadofjustmemorizingthedatausedduringtraining.Onceyouhavebuiltyourmodel,itisessentialtodecideifyourmodelisperformingwellandthemostimportantquestionthatarisesishowgoodyourmodelis?So,evaluatingyourmodelisthemostimportanttaskinthedatascienceprojectwhichdelineateshowgoodyourpredictionsare.Manymetricsareusedinmachinelearningtomeasurethepredictiveaccuracyofamodel.Thechoiceofaccuracymetricdependsonthemachinelearningtask.InmultilabelsproblemssuchasPOStaggingandCCGsupertagging,the“accuracy”ispreciselytheeffectivenessmeasureandthemostcommonevaluationmetricusedinthearea.AccuracyfortheCCGsupertaggingcanbedefinedastheproportionofthepredictedcorrectlabelstothetotalnumberoflabelsforthatinstance.Todothis,weusetheCCGsupertaggertoassignlexicalcategoriestoeachsymbolonthetestdatasetandthencomparethepredictedcategoriestothetruthsupertags.Overallaccuracyisthepercentageofsupertagscorrectlylabeledongoldlabeledsetasfollow:ThenumberofthecorrectlysupertaggedwordsreportedbythesystemAccuracy=Thetotalnumberofinstances(1-11)whereinstancesrefertothenumberofsupertaggedwords.1.8DatasetThePennTreeBank(PTB)isthecommondatausedformanyNLPtasks.ThePTBhasbeentranslatedtocarryoutmanylinguisticformalismssuchasTAG[60][61],LFG[62],andHPSG[63]andCCG[38][39].Tobecomparablewiththeresultsreportedbyprevi--18- 第1章IntroductionousworkonCCGsupertaggingtask[14][23][45][49][58][59],weexperimentedwiththesamedatasetsnamed"CCGBank"corpus[38][39].TheCCGBankisatreebankofCCGnormal-formderivations,createdfromthePTB(Marcusetal.,1993)withasemi-automaticconversionprocess.Hockenmaier[38]givesadetaileddescriptionoftheprocedureusedtocreatetheCCGbankdataset.TheCCGBankcorpusprovidesthelexicalcategorysetusedbythesupertagger.ThefollowingFigure1-3showsanexampleofasupertaggedsentencefromtheCCGBankcorpus.图1-3Examplefromsection00oftheCCGBankcorpus.WefollowthestandardsplitanddividetheCCGBankdatasetintosection02-21astrainingsetstotrainourmodels,section00asdevelopmentsetandsection23astestsetusedforevaluatingtheperformanceofourmodels.1.9ThesisContributionsThemaincontributionsofthisthesisareefficientmodelsfortheCCGSupertaggingproblem,tothisendisnecessarytodevelopandproposenewtechniquesbasedonexploring-19- 哈尔滨工业大学工学博士学位论文deeplearningmethodsintheaimtoreducethenumberoftheassignedCCGsyntacticcategorieswhichisbeneficialtomanyreal-worldapplications.Inparticular,thefollowingspecificissuesareconsideredinthisworks:a)GatedRecurrentUnitsforCCGSupertaggingb)Backward-BLSTMforCCGSupertaggingc)BLSTM-CRFmodelforCCGSupertaggingd)Character-WordembeddingsforCCGSupertaggingToaddresstheabovementionedissues,wedevelopnovelapproachesandmethodsforCCGSupertagging.Themaingoalsoftheseapproachesarebrieflyintroducedinthefollowing:a)GatedRecurrentUnitsforCCGSupertaggingIncontrasttopreviousstudiesbasedonmachinelearningalgorithmsforCCGsu-pertaggingwhichrequireextensivefeaturesengineering,theapplicationofdeeplearningtechniquestotheCCGstructuredpredictionproblemisthebasicobjectiveofourwork.DifferentlytothelastproposedCCGsupertaggerproposedbasedonsimpleRNNs,weproposedanovelapproachforCCGSupertagging.Inthiswork,WehaveappliedGRUnetworks.Thismethodusedwordembeddingsfromeachinputentry,thenadeepGRUarchi-tectureisintroduced.UnliketheexistingRNNmethodthatusesonedirectionforinputrepresentation,wehaveproposedtwodirectionalmethodthatreadsinputsfrombothleftandrightpositionsbyBGRUnetworks.Moreover,weuseadeeparchitecturethatismoreconvenientincapturinginteractionsbetweenwords.Theexperimentalresultsshowthattheproposedarchitecturepresentsanefficientmodelandachievesgoodperformancethanthestateoftheartmethodsonbothsupertaggingandmulti-tagging.b)Backward-BLSTMforCCGSupertaggingInthisapproach,amoreefficientrecurrentnetworkisusedbasedonLSTMnetworkswhichareproventobemoreeffectiveinmemorizinginputdataforlongperiods.WeintroduceacombinedarchitecturebasedonbackwardandBLSTMnetworks.TheinputentriesrepresentationsarefirstfedintoaBackward-LSTMlayer,andthenaBLSTMlayerisusedtobettersavehistoricalentriesfrombothdirections.Afterthat,aSoftmaxactivationfunctionisusedtodecodeeachoutputprobabilityintoitscorrespondingCCG-20- 第1章Introductioncategory.Ourmethodwastestedonthreedifferentdatasets.Theexperimentsdemonstratethatourmethodachievesbetterresults.c)BLSTM-CRFmodelforCCGSupertaggingInthismodel,anewapproachfortheCCGsupertaggingasasequencelabelingproblemispresented.Theproposedmethodisbasedonacombinationofthebenefitsofbothmachinelearninganddeeplearningtechniques.Deeplearningmethodsareusedtoautomaticallyextractinputfeaturesrepresentations.Whereas,thetraditionalstatisticalmodelsbasedonmachinelearningalgorithmsbenefitfromknowledgeaboutneighboringprediction.AnefficientmethoddevelopedfortheCCGsupertaggingisintroducedbasedonLSTMandCRFmodels.Weconcatenatethetwostrategieswiththeaimtobenefitfrombothinputrepresentationsandprioroutputpredictions.TheexperimentalresultsondifferentdatasetsshowthattheproposedtechniqueisefficientfortheCCGsupertaggingtask.Theproposedmodelachievesbetterperformancesthanthecurrentstate-of-the-artmethodsforbothsupertaggingandmulti-tagging.d)Character-WordembeddingsforCCGSupertaggingDifferentLSTMbasedarchitecturehasbeenproposedfortheCCGSupertaggingtaskandachievesgoodresults.However,existingmodelsstillsufferfromtheOOVproblemwhereunknownandrarewordsdonotappearinthepre-trainedwordembeddings.Inthiswork,weproposetoexploitthestrengthsofdifferentembeddingsinasimplebuteffectivewaytodealwiththeOOVproblem.Intheproposedmodel,wehavecombinedbothwordembeddingsandcharacterembeddingsindifferentBLSTMarchitecturetohaveefficientinputrepresentationswhichwereaccurateonout-of-domaindatasets.1.10OrganizationoftheThesisThisthesisisorganizedinfivechapters.ThepresentchapterintroducesabriefoverviewofCCGsupertaggingproblem.Alsothebackground,themotivationandthemaincontributionofthisthesis,thisisimportanttotherestofthedocument,italsopresentstheliteraturereviewonthesupertaggingtechniquesdevelopedingeneralandalsotheapproachesproposedspecificallyfortheCCGsupertaggingsuchME,NNandothermethods.Therestofchaptersareaddressingourcontributionsdiscussedinsection1.9,wherewepresenttheproposedtechniquesdevelopedduringourPh.D.study.-21- 哈尔滨工业大学工学博士学位论文Chapter2providesabriefreviewofneuralnetworks.WedonotaimtogiveacomprehensivereviewofNNsandRNNs;instead,webrieflyreviewRNNsforsequencetaggingproblems.Also,wedescribeourapproachofapplyingGRUstoovercomethedrawbacksoftraditionalRNNs.Weoutlinethegeneralmodelarchitectureandourimplementations.Theevaluationandtheexperimentalresultsarepresentedattheendofthischapter.Chapter3introducestheproposedLSTMbasedarchitecturefortheCCGSupertag-gingtask.First,weprovideadescriptionofourneuralnetworkwhereaBackwardLSTMlayerisusedforinputsrepresentationsthentheoutputsfromtheBackwardLSTMlayerarefedasinputtoaBLSTMnetwork,aftertoaSoftmaxactivationfunctionasthefinaloutputgeneratorwhichdecodeseachprobabilityintoitsCCGcorrespondingcategory.Second,wegivedetailsaboutourexperimentssettings.Wealsoevaluateourmodel’efficiencyondifferentdatasets.Experimentalresultsarepresentedattheendofthischapter.Chapter4describethethirdpointofourdissertation.Inthischapter,wefirstlydescribethecombinedarchitecturefrommachinelearninganddeeplearningstrategies.Weproposedtoexploitthestrengthsofbothapproachesinasimplebuteffectivewaytobenefitfrombothinputandoutputinformation.Inthiswork,weuseBLSTMstomodelinputentriesandConditionalRandomFields(CRF)tomodelthetagoutputswhichbringfurtherimprovementtothesupertaggingaccuracyofthemodel.Secondly,wediscussourexperimentssettingsandparameterstuning.Andfinally,wealsoevaluateourmodelforbothin-domainandout-of-domaindatasets,theachievedresultsarediscussedintheendofthechapter.InChapter5,theproblemofunseenwordsisaddressed.TheOOVproblemcangreatlyinfluencetheperformanceofthesupertaggerwhentestedonout-of-domaindatasetsaswellasin-domaindatasetswithrareandinfrequentwords.Thus,wepro-posedaneffectiveandsimplemodelbasedoncharacterandwordembeddings,inordertogainmuchinformationaboutinputentriesthatdonotappearinthepre-trainedwordembeddings.Thefirstsectionofthischapterisdevotedtoourmodeldescription.Next,wereportourexperimentssettings.Inaddition,wediscussourexperimentsresults,andfinally,weconcludethechapter.Intheend,weconcludeourthesis.Furthermore,futureworksoftheresearchactivity-22- 第1章Introduction图1-4Dissertationoutlines.arediscussed.Figure1-4showstheoutlinesofthedissertation.-23- 哈尔滨工业大学工学博士学位论文第2章GatedRecurrentUnitsfortheCCGSupertaggingtask2.1IntroductionInthepreviouschapter,wehavepresentedbackgroundfortheCCGsupertaggingtaskandreviewedallthepreviouslyproposedmethodstosolveit.StatisticalmachinelearningmethodsworkwellfortheCCGsupertaggingonlybecauseoftheextensivelydesignedrepresentationsofinputssuchaslexicalfeatures.Deeplearninghasemergedtojointlylearngoodfeatures.RNNshavebeenproposedfortheCCGsupertaggingtaskasthesimplerecurrentmodelsthatusethememoryofknowledgeabouttheinputrepresentationapartfromcarryingonlexicalfeatures.However,memorizinglongsequenceinputsinvanillaRNN-basedsupertaggersisdifficulttotrain.SinceRNNscannoteffectivelymemorizeinformation,inthischapter,weexploreamoresophisticatedrecurrentmodelfortheCCGsupertaggingproblem.WepresentadeeplearningapproachbasedonGatedRecurrentUnit(GRU)networkstoimprovetheperformanceofthesupertagger.Inthiswork,wehavemadeuseofanefficientmodelthatcanmemorizeinformationnotonlyforalonghistoricaltimebutalsofrompastandfutureinputsequences.Theorganizationofthechapterisasfollows:Section2.2describessomebasicdefinitionsandnotationfordeepneuralnetworks.Section2.3isdevotedtoourpartic-ularapproachtoCCGsupertaggingusingGRUnetworks.Next,Section2.4describesthedifferentexperimentsconductedforthetask.Inaddition,Section2.5presentstheexperimentalresultsonsupertaggingandmulti-taggingaswell.Andfinally,Section2.6providestheconclusion.2.2NeuralNetworksInthissectionwewillbrieflyreviewsomeofdeepneuralnetworksusedforprocessingsequentialdataincludingRNNs[43][44],BRNNs[46][47][48],andGRUs[64].2.2.1DeepLearningStatisticalmachinelearningstrategieshavebeenwidelyusedforsolvingmanyNLP-24- 第2章GatedRecurrentUnitsfortheCCGSupertaggingtasktaskssuchassequencetaggingproblemsandmorespecificallyfortheCCGsupertaggingproblem[14].Mostmachinelearningbasedmodelshadexploitedshallowstructuredarchitectures.Thesearchitecturestypicallycontainatmostoneortwolayersforinstance.GaussianMixtureModels(GMMs),linearornonlineardynamicalsystems,CRFs,MEmodels,SupportVectorMachines(SVMs),andMulti-LayerPerceptron(MLP)aresomeexamplesoftheshallowarchitectures.Despitetheeffectivenessofshallowarchitecturestosolvemanysimpleorstrainedproblems;theirmaindisadvantagesaretheirlimitedmodelingandrepresentationalpowerwhichcancausedifficultieswhendealingwithmorecomplicatedreal-worldapplications.Moreover,thosemethodsrequirethedesignandselectionofanappropriatefeaturespacetobedevelopedbyexpertsanditiscostly,anddifficultintermsofcomputationaltimeorexpertknowledge.Asanalternative,automaticallylearningthefeaturescanbeconsideredasarelevantchoice.ArtificialNeuralNetworks(ANNs)modelshavebeenintroducedoverdecadesifnotcenturies.EarlierstudieswithANNswerestartedinthelate1950swiththeintroductionoftheperceptron,atwo-layernetworkusedforsimpleoperations,andgrowthinthelate1960swiththedevelopmentofanefficientgradientdescentmethodcalledtheback-propagationalgorithm[43]appliedtoNNforefficientmultilayernetworkstraining.ANNsrepresentaclassofmachinelearningmodels.InANNs,theartificialneuronformsthecomputationalunitofthemodelandthenetworkdescribeshowtheseunitsareconnectedtooneanother.ThesimplestversionofANNsisfeed-forwardNN.Basically,afeedforwardNNreceivesandmapsasetofinputstooutputs.EachNNisconstructedbyseveralinterconnectedneurons,organizedinlayerswithassociatedweights.AnexampleofafeedforwardNNisshowninFigure2-1whichconsistsofthreelayers:theinputlayerwhichreadinputsandtransferthemtothehiddenlayerperformingcomputationstobetransferredtotheoutputlayer.Feed-forwardNNsexcelatsolvingtheCCGsupertaggingtask[23]andtopredictthemostlikelytagsforthewordsinagivensentencewithoutrequiringanylexicalfeatures.However,themaindisadvantagesofNNsarethehugenumberoffreeparameters(theweights)tobelearned.AndfortheCCGsupertaggingtask,feed-forwardNNscanonlymapfrominputtooutputvectorswithnocyclicconnections,andtheoutputmayonlydirectlydependonthecurrentinputatthattimestepwithoutanyinformationaboutthe-25- 哈尔滨工业大学工学博士学位论文图2-1AnexampleofanArtificialNeuralNetwork.surroundinginputs.Therehasbeenaresurrectionofintereststartingfromthemid-2000swiththeincep-tionofthefast-learningalgorithmbyG.Hinton[65]andtheintroductionofGPUs,roughlyin2011,formassivenumericcomputationthatopenedtherouteformoderndeeplearningasthenew-generationofNNscharacterizedbydeeparchitectures.图2-2AnexampleofaDeepNeuralNetwork.Feed-forwardNNsorMulti-LayerPerceptronsMLPswithmanyhiddenlayers,oftenreferredtodeepneuralnetworks(DNNs),areexamplesofthemodelswithadeeparchi-tecturewiththepresenceofmorethantwolayers,aninputlayer,oneormoreso-called-26- 第2章GatedRecurrentUnitsfortheCCGSupertaggingtask"hidden"layers,andanoutputlayerasdepictedinFigure2-2.Afewyearsago,re-searcherscalledthedeeplearningnetworksas"deep"with3-5layers,andnowithasgoneupto100-200layers.DeepLearninghasappearedasthenewareaofmachinelearningresearch[65][66]withtheobjectiveofmovingmachinelearningtowardsitsoriginalgoals:AI.ModerndeeplearningnetworkshavebeenappliedwithsuccessformanyNPcom-pleteproblems.Byaddingmorelevels(layers),researchersreportedpositiveexperimentalresultsforseveraltasks[67][68][69][70].Sincethemidofthe2000stonowadays,thetech-niquesdevelopedfromdeeplearningresearchhavealreadybeenimpactingawiderangeofsignalandinformationprocessingworkwithinthetraditionalandthenew,keyaspectsofmachinelearningandAI[66][71][72][73][74].2.2.2RecurrentNeuralNetworksInCCGsupertagging,theobservationsequencemaydependonmultipleinputsforlonghistoricaldependencies.TheneedformodelsandNNsthatcanmapfromtheentirehistoryofinputstopredicteachoutputandallowrecurrentconnectionsaswellisnecessary.OnewaytosatisfytheabovecriteriaistouseRNNstoestimatetheoutputprobabilitiesbasedonthecurrentandpastinputsallowingcyclicconnectionswithasufficientnumberofhiddenunits[75].RecurrentnetworksarethemostimportantonesforCCGsupertagging.RNNsaretheveryflexiblemethodsasafamilyofANNarchitecturesthathavetheabilitytomakeuseofsequentialinformationperformingthesameactionforeachelementinasequencewheretheoutputatagiventimestepisrelatedtothatofprevioustimestepsonlong-distancedependencies.TheprimaryadvantageofRNNsistherecurrentconnectionsmemorytocaptureinformationandstorepreviousinputsintotheinternalnetworkstateinordertoinfluencethenetworkoutput.RNNisviewedasanNNspecializedforprocessingasequenceofsymbols(x1;x2;:::;xt).Mostrecurrentnetworkscanalsoprocesssequencesofvariablelengthandtoomanylongsequencesthanfeed-forwardNNs.DifferentvariantsofRNNshavebeenproposed,suchasElmannetworks[44],Jordannetworks[76],timedelayneuralnetworks[77]andechostatenetworks[78].ThestructureofwidelyusedRNNsmodelsforsequencetaggingproblemsconsistsofaninputlayer,hiddenlayer,andoutputlayerasdepictedin-27- 哈尔滨工业大学工学博士学位论文图2-3GeneralstructureofsimpleRNNs.Figure2-3.AusefulwaytovisualizeRNNsisby’unfolding’thecyclicconnectionsofthenetworkovertheinputsequence.Figure2-4isanexampleofanunfoldedRNNforthree(3)timesteps.IntheFigure2-4,SectionArepresentsafoldedstateofRNNswithitscorrespondingunfoldedversioninSectionBobtainedbyunrollingthenetworkstructureforthecompleteinputsequence,atdifferentanddiscretetimeswhichinthisexamplecontainsthree-layerneuralnetworksandcanbereferredtodeepneuralnetworkbecauseithasmorethan1hiddenlayer.Notethattheunfoldedgraph,unlikethefoldedgraph,containsnocycles.FromFigure2-4,theUweightsrepresenttheweightofneuronsbetweentheinputsxandthehiddenstateh.Wweightsrepresenttheweightsoftheneuronsbetweenhiddenstateh.AndVweightsareneuronsweightsbetweenthehiddenstateshandtheoutputO.Eachnoderepresentsalayerofnetworkunitsatasingletime-step.TheformulasthatgovernthecomputationhappeninginanRNNareasfollows:•xtistheinputattimestept.•htisthehiddenstateattimestept.Itisthe"memory"ofthenetwork.htiscalculatedbasedontheprevioushiddenstateandtheinputatthecurrentstep:ht=f¹Uxt+Wht 1º(2-1)ThefunctionfusuallyisanonlinearitysuchastanhorReLU.ht 1isrequiredtocalculate-28- 第2章GatedRecurrentUnitsfortheCCGSupertaggingtask图2-4GeneralstructureofasimpleRNNunfoldedforthreetimesteps.thefirsthiddenstate,andistypicallyinitializedtozeros.•Otistheoutputatthetimestept.InCCGsupertagging,thehiddenlayerhtisupdatedbasedontheinputxtwhichrepresentinputfeaturesandtheprevioushiddenstateht 1,andtheoutputlayerytrepresentthepredictedlexicalcategories.FormallytheRNNcomputesthehiddenlayerhtandtheoutputlayerytasfollows:ht=f¹Uxt+Wht 1º(2-2)yt=g¹Vhtº(2-3)whereU,W,andVaretheconnectionweights,andgistheactivationfunction.2.2.3BidirectionalRNNTheRNNswehavepresented,have"causal"structures,i.e.,thecurrentinputisinfluencedbythepast,butnotthefuture.Insequentialdata,thestateoftheoutputdependsonthepreviousinputsaswellasthefuturestate.ThereisaspecialcategoryofRNNs,inwhichthestateofthesystematthetimesteptdependsnotonlyonthelearnedinputsfromthepastbutalsoontheinputsfromthefuture.ThissortofRNNswhichcancaptureinformationfromthewholesequenceisknownasBidirectionalRNNs(BRNN).Fromitsname,aBRNNhas2RNNsinittoprocessthesequencefromtwodirections,sothat,wehaveinformationfromthewholesequence.InBRNNs,ateachtimestep,we-29- 哈尔滨工业大学工学博士学位论文图2-5GeneralstructureofBRNNunfoldedforthreetimesteps.havetwohiddenstates:onehiddenstatetocaptureinformationfromlefttorightwhilethesecondcapturestheinformationusingtheoppositedirectionfromrighttoleft.AnunfoldedgraphicalrepresentationofBRNNisdepictedinFigure2-5.IthasbeenprovedthatsimpleandbidirectionalRNNsbasedmodelsdobetterthanfeed-forwardNNsmodelsfortheCCGsupertaggingtask[45][49]andhaveachievedthestate-of-the-art.WhiletraditionalRNNsareabletousecontextualinformationwhenmappingbetweeninputandoutputsequences,thelengthofthecontextthatcanbeinpracticememorizedisquitelimited.ThemaincomplicationwithvanillaRNNsisthatthemodelcan’tconcentrateonlonger-termpredictionsandtheinfluenceofinputsonthehiddenlayers,andthereforeonthenetworkoutput,eitherdecaysorblowsupexponentiallyasitcyclesaroundthenetwork’srecurrentconnections.Inotherwords,RNNshaveareasonablememorybutnocapacitytorememberthingsthathappenedforlongdistancedependencies.Thismeansthatitbecomeshardertothemodeltolearnlong-termdependenciesintheinputsequencewhichisoftenreferredtothevanishinggradientproblem[79][80][50]asisillustratedschematicallyinFigure2-6.Severalalternativesofrecurrentcellswereproposedtosatisfytheabovecriteriatoeasilytrainwhileavoidingthevanishinggradientsproblem.OnevariationistheGatedRecurrentUnitnetworksrecentlyproposedbyChoetal.,[64]thatareeasytotrainwhileavoidingthevanishinggradientsproblemandrisewiththedifficultytotraintraditionalRNNsusinggatedmechanism.-30- 第2章GatedRecurrentUnitsfortheCCGSupertaggingtask图2-6Illustrationofthevanishinggradientproblem.2.2.4GatedRecurrentUnitsGRUnetsareausefulfamilyofrecurrentdeepneuralnetworkstoprocesssequentialdata.GRUshavebeenproposedbyChoetal.,[64]toovercometheshortcomingsofRNNs.ThemaincomponentofGRUsisthe"memorycell"whichdecidesthedegreeofinformationtokeepinthememoryfromthepreviousstates.GRUsareknowntobegoodatpreservinglong-distancedependencieswithadditionalparametersthatcontrolwhenandhowtheirmemoryisupdated.Conceptually,GRUnetworkshaveresetandforgetgatesthathelptoprotectitsmemory,sothatitisabletomakelonger-termpredictionsandcancontroltheinformationasfollow:•Theresetgater:determineshowtocombinethenewinputwiththepreviousmemoryanddecidewhetherthepastsequenceisrelevantforthefutureornot.•Theupdategatez:defineshowmuchofthepreviousmemoryinformationtokeeparound.Mathematically,aGRUhiddenstatehtgivenaninputxiscalculatedasisdescribedbyequationsbelow:zt=¹Wz»ht 1;xt¼+bzº(2-4)rt=¹Wr»ht 1;xt¼+brº(2-5)h˜t=tanh¹Wh»rt⊙ht 1;xt¼+bhº(2-6)ht=¹1 ztº⊙ht 1+zt⊙h˜t(2-7)-31- 哈尔滨工业大学工学博士学位论文whereisthelogisticsigmoidfunction,randzarerespectivelyresetandupdategates,⊙standsforelement-wisemultiplication,Waretheweightmatrices,andthebtermsdenotebiasvectors.Figure2-7illustrateGRUcomponentswhererandzaretheresetandupdategates,respectively,andhandh˜aretheactivationandcandidateactivation.图2-7GatedRecurrentUnitsarchitecture[64].2.3BGRUproposedmodelfortheCCGSupertaggingtaskRecurrentnetworksareconsideredasaclassofdeepnetworksforsupervisedaswellassequencelearningtasks,wherethedepthcanbeaslargeasthelengthoftheinputdatasequence.IntheCCGsupertagging,wewanttooutputapredictionywhichmaydependonthewholeinputsequence.Theinformationfrombothpastandfuturedirectionsofaninputentryareveryimportantforthepredictionofthecurrentoutput.Forthisreason,itisreasonabletousemodelsthatareabletocapturepreviousandfutureinputinformationaswell.AnelegantsolutiontomodeltheCCGsequentialdatathathasachievedhighaccuraciesinmanysequencelabelingtasksisbidirectionalmodels.Inourproposedmodel,weusebidirectionalarchitecturebasedonGRUnetworks.TheideabehindBidirectionalGRUs(BGRUs)istopresenteachsequenceintwoseparatelayerstocaptureinformationfromthetwosidesofaninputentry:onelayerprocessthedatafromrighttoleftandretainthepreviousinformation,whereasthesecondlayerusestheoppositedirectionandprocessthedatafromlefttotherightsavingthefuturecontext,whichiswell-suitedforourtask.Finally,Thetwooutputsfromeachlayerarethenconcatenatedtoformthefinaloutput.OurproposedmethodconsistsofthreemaincomponentstopredictthefinalCCG-32- 第2章GatedRecurrentUnitsfortheCCGSupertaggingtaskoutputsupertags:InputLayer,GRUNeuralNetwork,andOutputLayer.2.3.1InputLayerTheemergenceofNNhasintroducedandmakesuseof"embedding",whichreferstotherepresentationofsymbolicinformationinnaturallanguagetextfromthesparsevectorscodedwithaveryhighdimension(i.e.,thevocabularysizeV)intolow-dimensional,real-valuedvectorsviaaneuralnetworkandthenusedforprocessingbyNNlayers.TheearlyworkhighlightingtheimportanceofwordembeddingcamefromCollobertandWeston[68],Turianetal.,[41],andCollobertetal.,[11],althoughtheoriginalformcamefromBengioetal.,[81]asasideproductoflanguagemodeling.GivenasentenceofNwords(W1;W2;:::;Wn),theembeddingfeatureofwt,W:words!Rnisaparameterizedfunctionmappingwordsinsomelanguagetovectorsandisobtainedbyprojectingitintoann-dimensionalvectorspacethroughthelookuptable.Eachdimensiondescribessyntacticorsemanticpropertiesoftheword.IthasbeenshownthatwordembeddingshavebeenexceptionallysuccessfulandplayavitalroletoimprovemanyNLPtasksperformancesuchassequencetaggingproblem[11][23].Thekeyadvantageofusingthecontinuousspacetorepresentwords(orphrases)isitsdistributednature,whichenablessharingorgroupingtherepresentationsofwordswithasimilarmeaning.Intheinputlayerofourmodel,wemakeuseofwordembeddingsthathavebeenproventobeusefulfortheCCGsupertaggingtask[23].Wemakeuseoftwokindsofwordembeddingswhereeachwordistransformedintoid(i.e.,identification)whichisdefinedinalookupdictionary;thedictionaryconsistsofwordsinthetrainingsetandthenembedsintolow-dimensionalrepresentation.1.Wordindex(task-specific)embeddings:weusetask-specificwordembeddingsmodel,becauseseveralmisspellingswords,abbreviations,andcompositionsofwordsoccurinthetrainingdata.Thesewordsareidentifiedas’UNKNOWN’wordsbyapre-trainedwordembeddingsmodel.Webuildourtask-specificwordembeddingmodelusingthe’EMBEDDING’layeroftheKeras[82]library.Theembeddingslayertakesasinputa2-dimensionalmatrixofintegersrepresentingeachwordinthecorpus(indexofthewordinthecorpus)andoutputsa3-dimensionalmatrix,whichrepresentsthewordembeddingmodelthatmaptheintegerinputstothevectorsfoundatthecorrespondingindexinthe-33- 哈尔滨工业大学工学博士学位论文embeddingmatrix[82].2.Pre-trainedWordembeddings:ourbestmodelusespre-trainedGoogle’sWord2Vec300-dimensionalembeddingstrainedon100billionwordsfromGooglenews[83].FollowingCollobertetal.,[11]allwordsarelowercasedbeforepassingthroughthelookuptablestoconvertthemintotheircorrespondingembeddingsandalsoallnumbersarereplacedbyasingledigit’0’.Forwordthatdoesnothaveanentryinthepre-trainedwordembeddings,the’UNKNOWN’entryfromthepre-trainedembeddingsisused.Moreover,followingLewisandSteedman[23];twosetoffeaturesareusednamelysuffixesandcapitalization:1.Capitalizationfeature:thecapitalizationfeaturehasonlytwovaluesindicatingwhetheragivenwordiscapitalizedornot.2.Suffixesfeature:wefollowthealmoststate-of-the-artexistingCCGsupertaggersusingsuffixesofsizetwo.Thelook-uptablesarefirstconcatenatedintheinputlayerandthenfedintothenetwork.2.3.2GRUNeuralNetworkCCGsupertaggingwasperformedusingBGRUarchitecture.Asthenamesuggests,ourmodelhasabidirectionalarchitecturewhichcombinestwoGRUlayers:thefirstGRUlayermovesforwardthroughtimebeginningfromthestartofthesequencetoitsend,andthesecondGRUlayermovesthroughtheoppositedirectionofthefirstlayerandprocessthesequencestartingfromitsendtoitsbeginning.ThisallowstheoutputunitsO(t)tocomputearepresentationthatdependsonboththepastandthefuture.Inourproposedmodel,thegiveninputsencodedfromthepreviousprocessintheinputlayerarefedtoaBGRUneuralnetwork.AforwardGRUlayerprocesstheinput !sequencefromlefttorightandcomputethehiddenstate(ht)tosaveinformationfromthepast,andthebackwardGRUlayersavesthefutureinformationofagiveninputby processingthesequencestartingfromitsendandcalculatethehiddenstate(ht).Deeparchitectureshaveprovedtobefruitfulformanytasks.Therefore,inourmodel,weinvestigatewith2-BGRUnetworksarchitecturethatismoreconvenientincapturingcomplexinteractionsinthecontextbetweenwords.Theoutputsfromeach-34- 第2章GatedRecurrentUnitsfortheCCGSupertaggingtaskGRU(backwardandforward)arethenfedintoanotherbackwardandforwardGRU !layers.Finally,theoutputsfromeachlayerateachtimestepareconcatenated[ht,ht]andfedthroughtheoutputlayer.ThearchitectureofourmodelisshowninFigure2-8.图2-8BGRUproposedmodelfortheCCGsupertagging.2.3.3OutputLayerThesoftmaxintheequation(2-8)representsthesoftmaxfunctionusedforourCCGSupertaggingtaskasamulticlasspredictionproblemwhichistheratiooftheexponentialoftheinputvaluesandthesumoftheexponentialofinputvaluesandoutputstheprobabilitiesofeachCCGclassoverallpossibleclasses.IntheCCGsupertaggingtask,theSoftmaxactivationfunctionreturnstheprobabilitiesofeachclasswhichwillbehelpfulfordeterminingthefinalmostlikelyCCGsupertagtopredictwhichhavethehighprobability.exp¹xiºSoftmax¹xiº=∑(2-8)jexp¹xjºThemainadvantageofusingSoftmaxisthatitensuresthattheoutputprobabilitieswillrangebetween0and1,andthesumofalltheprobabilitieswillbeequaltoone.Asaresult,theoutputateachtimestepfromtheBGRUarchitectureisfedthrough-35- 哈尔滨工业大学工学博士学位论文aSoftmaxlayertodecodeitintoprobabilitiesforeachSupertagformingthefinaloutputofthenetwork.2.4ExperimentSettingsInthissection,wereportthedatasetsandtrainingparametersofourexperiments,theachievedresultsarethendiscussedwhereweconductexperimentstoevaluateourmodelbyapplyingittosupertaggingandmulti-taggingfortheCCGgrammar.2.4.1DatasetAsdescribedinchapter1Section1.8,weusetheCCGBankcorpus[39]forourexperiments.Followingthesamesplit;wetrainedourmodelsonSections2-21fromtheCCGBankdatasetsusingsection00(1913sentences)fordevelopment.Ourexperimentstesttheutilityofourproposedmodelsinsection23fromtheCCGBank(2407sentences)asthetestset.2.4.2DataPreprocessingInourexperiments,firstdatapreprocessingwasemployedbeforepassingthedatasetthroughthelookuptables.Wepreprocessedallthedatasetsasfollows:•Allwordswerelowercased,•allsequencesofdigitswerecollapsedintoasingledigit’0’,•forwordsandnumberscontaining′n′,webacked-offtothesubstringafterthedelimiter.2.4.3Hyper-ParametersandTrainingWeimplementedtheneuralnetworkusingtheversion1.2.2ofkeras[82];aTheano-basedneuralnetworklibrary.Trainingandtestweredoneonthesentencelevel.2.4.4WordembeddingsSettingsWefollowtherecentworkreportedin[11]basedonneuralnetworksarchitectures.Collobertetal.,[11]appliedneuralnetworkarchitecturesandrelateddeeplearningalgo-rithmstosolveNLPproblemsfrom"scratch",wherenotraditionalNLPmethodsareusedtoautomaticallyextractfeaturesandavoidhand-craftedfeatureengineering.Collobertetal.,[11]automaticallylearninternalrepresentationsorwordembeddingfromvastamountsofmostlyunlabeledtrainingdatawhileperformingawiderangeofNLPtaskssuchaschunking,POStaggingandSemanticRoleLabeling(SRL).-36- 第2章GatedRecurrentUnitsfortheCCGSupertaggingtaskForpre-trainedword-embeddings,weinitializedourmodelwiththepubliclyavail-ablepre-trainedvectorscreatedusingword2vecwith300-dimensionalvectorstrainedonGoogleNewsnamed’Word2vec’[83].ForCCGsupertagging,weapplyBGRUneuralnetworkarchitectureusingtwobackwardandforwardlayers,wetestedtheaccuracyofourmodelonthedevelopmentsetwiththehiddendimensionvaluesrangeintheset{100,200,256,300,400,512,600}andfoundthatthehiddendimensionwithsize300showsthebestaccuracy.Forsuffixesandcapitalization,wefollowLewisandSteedman[23],andweuseembeddingswithafixedsizeof5.2.4.5LearningAlgorithmWeuseSGDoptimizertotrainourmodelsasagradientdescentlearningratemethodwithafixedlearningrateto0.01.WehaveexploredothermoresophisticatedoptimizationalgorithmssuchasAdamandAdeDelta[84]withoutanyremarkableimprovementoverSGD.Finally,theoutputsreceivedfromtheGRUneuralnetworkarefedtotheoutputlayerwithSoftmaxactivationfunctiontooutputaCCGsupertagcategoryforeachwordinaninputsentence.2.4.6DropoutOver-fittingisverycommonindeepneuralnetworkstraining.Inrecentyears,weseeimportantsuccessindeeplearningapproacheswiththeintroductionofthenewregularizationmethodbasedon"dropout",originallyproposedbyHintonetal.,[85].Weapplieddropouttotheinputlayerwithafixedprobabilityof0.2thatwasquiteeffectivetoregularizeourmodelandreduceover-fittinggivingsignificantimprovementsintheaccuracy.2.5ResultsandAnalysisInthissection,wepresenttheresultsoftheevaluationofourproposedBGRUarchitectureforCCGsupertaggingontheCCGBankdatasets.Wealsoperformmulti-taggingexperiments,theresultsarediscussedbellow.2.5.1SupertaggingResultsWetrainedourmodelsfor90epochsandweusedthemodel’sparametersthatgivethehighestaccuracyonthedevelopmentset.Wetunedthehyper-parametersthentrained-37- 哈尔滨工业大学工学博士学位论文themodels.ThefinalchosenparametersarereportedinTable2-1.表2-1Thefinalchosenhyper-parameters.Hyper-parameterValueWordembeddingsgoogle’sWord2Vechiddendimension300Dropout0.2OptimizerSGDLearningrate0.01CCGsupertaggingasmanysequencetaggingproblemshavelongbeendominatedbymachinelearningmethods.WecompareourmodelperformancewiththeexperimentalresultsreportedbyClarkandCurran’smodelwithgoldandautoPOStagswhichareobtainedusingMEmodelswithasetoflexicalfeature.Moreover,NNswithwordembeddingswereapopularapproach,wealsocomparewiththemodelproposedbyLewisandSteedman[23],andthebestresultsreportedfortheCCGsupertaggingbyXuetal.,[45].Table2-2comparesourresultswiththosemodelsonsection00fromtheCCGBankcorpus(developmentset).表2-2Performancecomparisonwithstate-of-the-artmethodsonthedevelopmentset.ModelAccuracyC&C(goldPOS)92.60C&C(autoPOS)91.50NN91.10RNN93.07Ours93.47ResultsfromTable2-2indicatethatdeeplearningmodels(RNNandBGRU)haveproducedbetterresultsthanmachinelearningbasedapproaches.ItcanbeseenthatourBGRUachieveshigheraccuracyoverC&CmodelwithgoldPOStagswithanimprove-mentof(+0.9%).Moreover,ourmodelgain(+0.40%)overRNNmodel,itconcludesthat-38- 第2章GatedRecurrentUnitsfortheCCGSupertaggingtasktheuseofGRUcanbringbetterperformancethansimplerecurrentnetworksandtheuseofBGRUwasveryusefultomodelandmemorizemoreinformationfrombothdirectionsofaninputentry.TheoverallresultsofourexperimentsonthetestsetareshowninTable2-3.BGRUmodelimprovestheperformanceofCCGsupertaggingtoasignificantextent,bringingaccuracyupfrom91.57%to93.87%comparingtofeedforwardNNbyLewisandSteedman[23],andalsooutperformedtheRNNmodelproposedbyXuetal.,[45]withasignificantimprovement.Thismaybeduetothehigherqualityofthenetworkthatcanlearnfrompastandfutureentries,whichhelpthemodeltomakemoreaccuratepredictions.表2-3Performancecomparisonwithstate-of-the-artmethodsonthetestset.ModelSection23C&C(goldPOS)93.32C&C(autoPOS)92.02NN91.57RNN93.00Ours93.872.5.2Multi-taggingResultsSupertaggershavebeenusedeffectivelyinarangeofNLPtasks,suchasInformationRetrieval(IR)andparsing.Thereareavarietyofwaystoprocessthesupertaggingtask.Initially,supertaggerswereusedtochooseasinglesupertag,specificallythemostlikelyprobablesupertagtoagivenwordinagivencontextinthetrainingdata.Byreducingthesetofprobablyassignedpossiblelexicalcategoriesforeachwordinaninputsequence,supertaggerscontributetodramaticallyimprovetheefficiencyofmanyNLPcompletetasksbyprovidingacrucialsourceofinformation.However,insomecases,supertaggerswillprovidenotperfectandincorrectsupertags.Forexamplewhenparsing,ifthesupertaggerassignsonlyasinglesupertagtoeachword,thenitdoesnotleadtoavalidparsestructureanditsaccuracyistoolowtobeeffectivelyincorporatedintoaparserastheparserhasnootheralternativestoconsiderresultingadegradationintheaccuracy.-39- 哈尔滨工业大学工学博士学位论文Asatisfyingsolutiontothisproblemandthatisbeneficialforaccuracyimprovementwouldbethroughmulti-tagging.Multi-taggingreferstothetaskofassigningmorethanonesupertagtoeachwordinthesentence.However,theimmediatequestionthatmulti-taggingraiseis:whatorderthetagsshouldbeconsidered?Toanswersuchaquestion,supertagginghistoryhasproposedsomedifferentwaysthatarecapableofperformingthemulti-taggingtask.Chenetal.,[86]statethisquestionusingatrigram-basedsupertaggertochoosemultipletags,thentheViterbialgorithmtodeterminethemostlikelysequence.Afterthat,inplaceofassociatingeachwordwiththemostlikelypredictedsupertagfromthemostlikelypath,eachwordwasassociatedwiththe"n"supertagsthathadthehighestprefixprobabilities.Byincreasingthenumberofthesupplementaryassignedsupertags,thenumberofparsedsentenceswillincreasesimultaneouslyandthemorecorrectthesetofsupertagsprovidedthehighercoveragewillbe,leadingtoanincreaseintheaccuracy.However,whenthenumberoftheprovidedsupertagsisoverfour,theparsingbecameunattainableduetotimeconstraintsofparsingspeed.Accuracyisdecreasedintwoways:bynotprovidingenoughcategoriesatanylevel,leadingtonospanninganalysis;orbyprovidingtoomanycategories,causinganexplosioninthechart.Bymultitaggingwecanmakethesupertaggermoreaccurate,butatthecostofspeedastheparsermustconsiderlargersetsofpossiblecategories.ClarkandCurran[14]alsoapproachedthisquestion.Differently,theauthorsdevelopamulti-taggerbasedontheME-basedsupertaggeranddefinelevelsascutoffsformulti-taggingbasedontheprobabilitiesfromthemodel.ThelevelssetbyClarkandCurran[14]definecutoffsformultitaggingbasedontheprobabilitiesfromthemaximumentropymodel.Iftheparserisunabletoformaspanninganalysisthelevelisdecreasedandthesupertaggerisrerun.Theexactvaluesoftheselevelsgreatlyinfluencesparsingaccuracyandspeedwhereeachlevelrefertotheambiguityofthenumberoftheassignedsupertagstoeachword.Ratherthandefiningafixednumberoftagstobeproducedperword,thesupertaggerincludedalltagswhoseprobabilitieswerewithinthe""factorofthehighestprobabilitycategory.Mathematically,ifwehavenobservations,theselectedsupertagsaccordingto""factorshouldsatisfythefollowingequation:-40- 第2章GatedRecurrentUnitsfortheCCGSupertaggingtaskYi=fyjP¹Yi=yjSº>g(2-9)WhereYiisthesetofsupertagsassignedtothewordxiatthetimestepifromagivensentenceS.AccordingtotheEquation,foreachwordinthesentence,themulti-taggerthenassignsallthosecategorieswhoseprobabilitiesarewithinthefactorwiththehighestprobabilitycategoryforthatword.Wealsoevaluateourproposedmodelformulti-taggingwhereoursupertaggercanassignmorethanonecategorytoeachwordwhoseprobabilitiesarewithinthefactors.Theperformanceofourproposedmodelonmulti-taggingismeasuredontermofWORDaccuracywhereweconsiderthewordtobetaggedcorrectlyifthecorrectcategoryisincludedinthesetoftheassignedlexicalcategoriesandSENT(sentence)accuracyisthepercentageofsentenceswhosewordsareallsupertaggedcorrectlyusingthedefaultlevelsusedbytheC&Cparser[14]onthedevelopmentset.Theresultsoftheseexper-imentsarepresentedinTable2-4.ItcanbeseenthatourmodelresultsaremuchbetterperformancethanthepreviousmodelsforbothWORDandSENTaccuracyonalllevels.表2-4Performancecomparisonofdifferentmodelsformulti-taggingaccuracyonSection00fordifferentlevels.GRURNNNNC&C(autopos)C&C(goldpos)WordSENTWordSENTWordSENTWordSENTWordSENT0.07597.2267.2297.3366.0796.8361.2796.3460.2797.3467.430.03098.0874.9098.1274.3997.8170.8397.0565.5097.9272.870.01098.7181.8798.7181.7098.5479.2597.6370.5298.3777.730.00599.0185.0499.0184.7998.8483.3897.8672.2498.5279.250.00199.4290.9299.4190.5499.2989.0798.2580.2499.1787.192.6SummaryBackwardGRUisverypowerfulincapturingpastinformationonalongtimememo-rizingpreviouscontextinformation;ontheotherhand,forwardGRUisalsoveryefficientonmemorizingfutureinformationonlongperiods.However,itiswellknownthatsin--41- 哈尔滨工业大学工学博士学位论文gledirectionGRUsuffersaweaknessofnotutilizingthecontextualinformationfromtheotherdirectionofinput.BGRUsutilizeboththepreviousandfuturecontextbyprocessingthesequenceontwodirections,oneprocessestheinputsequenceintheforwarddirection,whiletheotherprocessestheinputfromthefuturedirectionandgeneratetwoGRUoutputvectors,thatwasverysuitabletoourtask.Inthischapter,wehavedescribedanapproachforCCGsupertagging.ThemodeldescribedhereisverysimpleandefficientforCCGsupertaggingandmulti-tagging.Theproposedapproachuseslookuptablesoffeatures.WeperformedexperimentsontheCCGBankdatasetswiththeaccuracyasanevaluationmetric.Experimentresultsshowthatourapproachachievesstate-of-the-artperformances.Wealsofindthat(1)traditionalRNNsareextremelyweakinmodelingsequentialdata,whileaddingneuralgatesdramaticallybooststheperformance,(2)BGRUperformsbetterthansimpleonestocaptureinformationfromtwodirections,(3)deeparchitectureismoreconvenienttocaptureinteractionsbetweeninputs,(4)Themodelshaveamuchhigheraccuracythanthenaivebaselinemodel.ThisworkdemonstratesthatBGRUarchitecturesarecapableofmodelingsequentialdataandrichhighaccuracyoversimplerecurrentmodels.TheimprovementscanonlybeduetothedeepbidirectionalGRUarchitectureadvantagescapturingtheinteractionbetweensequencesintwodirections.AlthoughGRUperformsreasonablywellforCCGsupertaggingtask,insomecasestheyareweakinmodelinglongdistanceinformation.UsingagoodnetworkthatismorepowerfulthanGRUsisagoodchoice.Further,weplantouseothersophisticateddeepneuralnetworks,whichhavebeenprovedtobeveryeffectiveformanyNLPtaskswhichareLSTMnetworks.LSTMnetscanmakeuseoflong-distanceinformation.HencewewillliketoexploretheirusefortheCCGsupertagging.-42- 第3章Backward-BLSTMmodelfortheCCGSupertaggingtask第3章Backward-BLSTMmodelfortheCCGSupertaggingtask3.1IntroductionAsdiscussedinthepreviouschapter,animportantbenefitofrecurrentnetworksistheirabilitytousecontextualinformationwhenmappingbetweeninputandoutputsequences.Unfortunately,traditionalRNNsfunctionbetterintheorythaninpracticebecauseRNNscanbuildtheircontextualinformationuponnomorethanthelastten(10)timesteps.Thatisbecausetheysufferfromproblemswithvanishingorexplodinggradientswherethecontextualinputinformationcanonlybeheldinanetwork’s"memory"foralimitedamountoftime.Sincethe1990s,toaddressthislimitation,researchershavedevelopedmanyalgo-rithmsandproposedmanyarchitectures,forinstance,GRUnetworksusedinthepreviouschapterbyChoetal.,[64].However,themostsuccessfulsolutioninapplicationsthathaveproventogivethebestresultsuptillnowandfavoredinthischapterisnamedLongShort-TermMemory(LSTM)networks.LSTMsarearedesignoftheRNNarchitecturearoundspecial"memorycell"units.LSTMbasedmodelshavebeenprovedtoperfectlyhandlewithgradientvanishingproblemofRNNs.Inthecontextofsequencemodeling,manyresearchershavesuccessfullyappliedsuchmechanismtolearnsequencesforlongtimespans.Inthiswork,thespecifictypeofneuralnetworkusedwasaBidirectionalLongShort-TermMemory(BLSTM)basedrecurrentnetwork.Wedesignasimpleandeffectivearchitecture.Moreover,wedemonstratethatbysimplycombiningabackwardLSTMandBLSTM,wecancapturelong-terminformationandwecanobtaincompetitiveperformancecomparedtothestate-of-the-artrecurrentnetworksfortheCCGsupertaggingtask.Thechapterisorganizedasfollows:Section3.1describessomebasicsofLSTMs.Next,Section3.2isdevotedtoourparticularapproachtoCCGsupertaggingusingBLSTMmodel.Inaddition,Section3.3describesthedifferentexperimentsconductedforthistask.Moreover,Section3.4presentstheexperimentalresultsforbothsupertaggingand-43- 哈尔滨工业大学工学博士学位论文multi-taggingexperiments,andfinallySection3.5providestheconclusion.3.1.1LongShortTermMemoryNetworksInthissection,wegiveadetaileddescriptionofLSTMnetworks.WealsodescribeBLSTMnetworkswhichhavegreatinfluenceontheCCGsupertaggingasmanysequencelabelingtasks.ThecyclicmechanismenablesRNNstorememberinputsatdifferenttimesteps.Theyare,therefore,averygoodchoiceforsequencelearning.However,becauseoftheirdifficulttrainingwheregradientdescentbasedalgorithmsgenerallyfailtoconvergeortaketoomuchtimeorbecauseoftheexploding/vanishinggradientproblem,whichimpliesthatthegradients,duringthetraining,eitherbecomeverylargeorverysmalltheirapplicationsinpracticewerequitelimitedtillthelate1990s.TherehavebeenmanyproposedapproachestodiminishthedrawbackswhentrainingRNNsincludingGRUnetworksintroducedinthepreviousChapter.Amongall,LSTMdiscoveredbyHochreiterandSchmidhuber[57]andlaterrefinedbyGers[87],appearstobeoneofthemostextensivelyadoptedsolutionstothevanishinggradientproblemandlearndependenciesrangingoverarbitrarilylongtimeintervalsthathavebeensuccessfullyadoptedandusedformanysequencemodelingtasks.图3-1FromRNNtoLSTM[87].LSTMnetworkshaveapowerfulandexpressivearchitecturethathasbecomethemostpopularvariantofRNNtohandlesequentialdataandhavebeensuccessfullyappliedto-44- 第3章Backward-BLSTMmodelfortheCCGSupertaggingtaskarangeofsequencetaggingproblemssuchasPOStagging[88][89],NER[90][91],sentimentanalysis[92][93][94],speechrecognition[95].HochreiterandSchmidhuber[57]proposedtochangethebasicunitofRNN,whichisasimpleneuronwithacomputermemory-likecell,called"LSTMcell".LSTMnetworkshavebeenmadeinaspecificway.TheyarethesameasRNNsexpectthehiddenlayerswerereplacedbymemoryblocks[87]thathavemadeadifferenceintheircapabilitytolearnlong-termdependencies.Figure3-1providesacomparisonbetweenRNNandLSTMarchitectures,whileRNNscontaincyclicconnectionsintheirhiddenstates,LSTMsstillhavetherecursiveconnectionofRNNwithmemorycells.Thememoryblocksstorethestateovertimeandhavebeenshowntobebetteratfindingandexploitinglong-rangedependenciesinthedata.HochreiterandSchmidhuber[57]introduceasimilartermtothatproposedbyChoetal.,[64]usinggatestopreventlimitedmemoryinRNNs.Amemoryblockcontainsoneormorememorycells:LSTMhastheabilitytoaddortoremoveinformationfromthememorycellthatiscontrolledandprotectedbygates.Amemoryblockiscomposedmainlyofthreegates:inputgate,forgetgateandoutputgate.图3-2LongShort-TermMemorynetworkarchitecture.ThearchitectureofanLSTMunitisshowninfigure3.2andisthearchitectureusedinthisthesis.ThemaincomponentsoftheLSTMunitare:•Input:theLSTMunittakesthecurrentinputvectoratthetimesteptdenotedbyxtandthehiddenstateoftheprevioustimestepdenotedbyht 1.Thesumoftheweightinputandhiddenstateispassedthroughanactivationfunction,resultinginht:-45- 哈尔滨工业大学工学博士学位论文xt=¹Wx:»ht 1;xt¼+bxº(3-1)•Inputgate:toprovidetheinputflowingintothememorycell.Theinputgatedecideswhichvalueswillbeupdatedandwhatinformationtostoreinthecell.Theinputgatereadsxtandht 1,computestheweightedsumandappliessigmoidactivation:it=¹Wi:»ht 1;xt¼+biº(3-2)•Forgetgate:TheforgetgateisthemechanismthroughwhichanLSTMlearnstoresetthememorycontentswhentheybecomeoldandarenolongerrelevant.Thismayhappenforexamplewhenthenetworkstartsprocessinganewsequence.Torememberorthrowawaytheinformationfromthecellstate,theforgetgatereadsxtandht 1asinputsandappliesasigmoidactivationfunctiontothesummedweightedinputs:ft=¹Wf:»ht 1;xt¼+bfº(3-3)•Memorycell:ThecurrentcellstateCtiscomputedbyforgettingirrelevantinformationfromtheprevioustimestepandacceptingrelevantinformationfromthecurrentinput.Theresult,ftismultipliedbythecellstateatprevioustimestepi.e.,Ct 1whichallowsforforgettingthememorycontentswhicharenolongerneededandsummedwiththemultiplicationoftheinputgatewiththecurrenthiddenstateht:Ct=ftCt 1+itC˜t(3-4)C˜t=tanh¹WC:»ht 1;xt¼+bCº(3-5)•Outputgate:theoutputgatedecideswhatpartsofthecellstatetooutputfromthememorycell.Theoutputgatetakestheweightedsumofxtandht 1andappliessigmoidactivationtocontrolwhatinformationwouldflowoutoftheLSTMunit:Ot=¹Wo»ht 1;xt¼+boº(3-6)•Output:TheoutputoftheLSTMunit,ht,iscomputedbypassingthecellstatestthroughatanhandmultiplyingitwiththeoutputgateot:ht=Ottanh¹Ctº(3-7)TheparametersoftheLSTMmodelaretheweightmatricesWandbiasesvectorsb-46- 第3章Backward-BLSTMmodelfortheCCGSupertaggingtaskinequations(3-1)-(3-7).3.2Backward-BLSTMproposedmodelfortheCCGSupertaggingtaskOneshortcomingofsimpleLSTMisthattheyareonlyabletomakeuseofthepre-viousinformationandmemorizethecontextonlyfromthepast,withoutanyinformationaboutthefuture.However,withasophisticatedapproachwherethewholecontextisaccessible,thereisnoreasontonotexploitfuturecontextaswellasthepreviousone.InCCGsupertagging,wehaveaccesstoboththeprecedingandfollowingcontextofaninputatagiventimestep,insteadofordinaryLSTM,apowerfulsolutionwhoseeffectivenesshasachievedhighaccuraciesinmanysequence-labelingtasks,suchasPOStagging[96],NER[97]andspeechrecognition[95]isrecurrentnetworksespeciallywithBLSTMcells[98].ThebidirectionalvariantofLSTM,(BLSTM)relyonasimpleidea,itusesLSTMrecur-rentmodelinbackwardandforwarddirectionsintimeonegoingleftandonegoingrighttocaptureinformationfromanywhereintheinputsentence.Inourmodel,weuseaBLSTMnetwork-basedmodel.ThefundamentalobjectiveistogainaccessfromtwodifferentLSTMlayers;forwardandbackward,respectively,totakeadvantagefromthetwoorientationsofaninputatagiventimestept.ThentheoutputsfromeachLSTMlayerareconcatenatedtoformthefinaloutput.Contrarytosomecases,whereathirdnetworkisusedinplaceoftheoutputlayer,wehaveusedthesimplermodel.OurproposedmodelconsistsofthreemainmodulestopredictthefinalCCGoutputsupertags:theInputLayer,LSTMNeuralNetwork,andOutputLayer.3.2.1InputLayerIntheinputlayer,ourNNisinspiredbytheworkofCollobertetal.,[11],wherefeaturevectorsarecomputedbylook-uptables,concatenatedtogetherandthenfedtothenetwork.Theinputlayerconsistsofthreemaincomponents:wordembedding,suffix,andcapitalization-basedfeatures:1.Wordembeddings:ourbestmodelusepre-trainedGoogle’sWord2Vec300-dimensionalembeddingstrainedon100billionwordsfromGooglenews[83].WealsorunourexperimentsonotherpublishedembeddingsfromLingetal.,[99]of100dimensionstrainedonReutersnewsdata.Inaddition,aswehypothesizedthatthewordembeddings-47- 哈尔滨工业大学工学博士学位论文usedinthestate-of-the-artmayperformbetter,wealsousedthepubliclyavailableTurianembeddingswith50and100-dimensionalembeddingsfromTurianetal.,[41].FollowingCollobertetal.,[11],allwordswerelowercasedbeforepassingthroughthelookuptabletoconvertthemtotheircorrespondingembeddingsandalsoallnumberswerereplacedbyasingledigit’0’.Additionally,inthesamewayasLewisandSteedman[23],weaddtwofeaturesnamelycapitalizationandsuffixesforeachword.2.Capitalizationfeature:followingLewisandSteedman[23]andXuetal.,[45],thecapitalizationfeaturehasonlytwovaluesindicatingwhetheragivenwordiscapitalizedornot.Thisfeatureiscalculatedbeforethepreprocessingofthedata.3.Suffixfeature:wefollowthealmoststate-of-the-artexistingCCGsupertaggersusingsuffixesofsizetwo.Weseparatelyconcatenaterepresentationsofthesefeatures,andthenusethemastheinputtothenetwork.3.2.2NeuralNetworkCCGsupertaggingisperformedusingBLSTMbasedmodels.Inthisarchitecture,theinputsencodedfromthepreviousprocessintheinputlayerarefedtoabackwardLSTMlayerthentoaBLSTMlayerasfollows:1.BackwardLSTMLayer:theextractedfeaturesofeachwordinasentencearefirstconcatenatedintheinputlayerandthenfedthroughabackwardLSTMlayer,whichhasstrongabilitytomemorizeinformationforlongdistance.Tocomputethehiddenstate (hBt),thebackwardLSTMreadtheinputsequencefromtheendtothebeginningateachtimestep,andtheoutputofthislayerisusedastheinputrepresentationtotheBLSTM.2.BLSTMLayer:theinputsrepresentationoutputtedfromthefirstbackwardLSTMlayeraretheinputtedtoasecondbackwardLSTMtocomputethehiddenstate !(ht),andaforwardLSTMtocomputethehiddenforwardsequence(ht).Thisallowsourmodeltoprocessthedatasequenceandcomputerepresentationforeachinputthatdependsjointlyoninformationlearnedfromthetwoorientationsofaninput(leftandright)atatimestept.Finally,theoutputsfromeachLSTM(backwardandforward)at !eachtimestepareconcatenatedtogether[ht,ht]andthenfedthroughtheoutputlayer.-48- 第3章Backward-BLSTMmodelfortheCCGSupertaggingtask3.2.3OutputLayerTheoutputoftheneuralnetworkateachtimesteptisfedthroughaSoftmaxlayertodecodeitintoprobabilitiesforeachSupertagandmakecertainthatthenetworkoutputsareallbetweenzeroandone,andthentheysumtooneoneachtimestep.Figure3-3illustratesthenetworkarchitectureindetails.图3-3Backward-BLSTMmodelfortheCCGsupertagging.3.3ExperimentsSettingsThedatasetsandparametersvaluesofourexperimentsaredescribedinthefollowingsections.3.3.1ExperimentalDataWeuseddifferentdatasetstotestthevalidityofourapproaches,mainlyweusein-domainandout-of-domaindatasets.Forin-domaindatasets,weusedtheCCGBankcorpus[39]describedinthefirstchapter.Followingthesamesplit,Sections2-21astraining,Section00asdevelopmentsetandSection23asindomaintestset.Fortheout-of-domaindatasets,weusetwodatasetsnamelyWikipedia(200Sen--49- 哈尔滨工业大学工学博士学位论文[100]O1tences)fromHonnibaletal.,,andBioinfercorpus(1,000Sentences)fromPyysaloetal.,[101].3.3.2DataPreprocessingThefollowingpreprocessingstepswereperformedtoallourdatasets:•Forallwordswecamedowntotheirlowercaseform.•Allsequencesofdigitswereconvertedintoasingledigit′0′.•Forwordsandnumberscontaining′n′,webacked-offtothesubstringafterthedelimiter.3.3.3ImplementationThecodeforourexperimentswaswritteninPython2.7.5.WeimplementedourBackward-BLSTMmodelusingtheversion0.2.0ofKeras[82],aTheano-basedNNlibrary.Bothtrainingandtestingweredoneonthesentencelevel.3.3.4Hyper-ParametersAsmentionedinthepreviousSection,weperformedexperimentswithdifferentsetsofpubliclypublishedwordembeddings.Table3-1givestheperformanceofdifferentwordembeddingsonthetermof1-bestaccuracy.AccordingtotheresultsinTable3-1,themodelsusingGoogle’sWord2Vec300-dimensionalembeddingsobtainasignificantim-provementandshowthatthechoiceofembeddingsiscrucialtoimprovetheperformanceofthistask.表3-1Comparisonoftheaccuracyresultsonthedevelopmentsetusingdifferentwordembeddings.WordembeddingsAccuracyGoogle-30093.53Turian-5093.35Turian-10093.29Ling-10092.81AbbreviationssuchasGoogle-300refertotheGoogleWord2vecembeddingswitha300-dimensionalembeddingsspace.Wemeasuredtheaccuracyofthedatasetforthecapitalizationandsuffixesdimen-O1https://sites.google.com/site/stephenclark609/resources-50- 第3章Backward-BLSTMmodelfortheCCGSupertaggingtasksionswithdifferentvaluesof5,10,16,32,64and128.Theexperimentalresultsshowedthatthedimensionsizeoffive(5)achievedthehighestaccuracy.Forthehiddendimen-sion,weexperimentedwithvaluesrangingfrom100to900,andthehiddendimensionwithsize400showedthehighestaccuracy.3.3.5LearningAlgorithmSinceagoodoptimizationmethodyieldsbetterresults,optimizationisthemaintaskofdealingwithmachinelearningproblems.TrainingwasdonebyAdamoptimizerwithafixedlearningrateof0.001.Duringtraining,wehaveexploreddifferenttypesofoptimizationstrategiessuchasSGDandAdeDelta[84]withoutanyimprovementoverAdam.Fortheoutputlayers,weusedtheSoftmaxactivationfunction.3.3.6DropoutWeobtainsignificantimprovementsinourmodelperformanceafterusingdropout;Table3-2comparetheresultswithandwithoutusingdropoutforbothdevelopmentandtestset;alltheotherparametersarethesame.表3-21-bestAccuraciesresultswithandwithoutdropoutondevelopmentandtestdata.DevelopmentSetTestsetDropout94.0994.25Nodropout93.5393.85Weobserveanessentialimprovementintheaccuracy,whichdemonstratesthatthedropoutbringssignificantimprovementintheperformanceandismoreeffectiveinreducingover-fitting[102].Weusedafixeddropoutrateof0.5.Table3-3reportsthechosenhyper-parametersforourbestmodels.Wetunedthehyper-parametersthentrainedthemodels.Weevaluateourmodelsonthetermof1-bestaccuracy(themostlikelypredictedsupertag).Wetrainedthemodelsforthirty(30)epochs;ourbestmodelwasobtainedatthe27thepoch.Weusedthemodel’sparametersofthehighestaccuracyonthedevelopmentset.Figure3-4showsthe1-bestaccuracyofourBLSTMproposedmodelonSection00(developmentset)oftheCCGBank.-51- 哈尔滨工业大学工学博士学位论文表3-3Thefinalchosenhyper-parameters.Hyper-parameterValueWordembeddingsgoogle’sWord2VecCapitalizationdimension5Suffixdimension5Dropout0.5Numberofepochs30Hiddendimension400OptimizerAdamLearningrate0.001图3-41-bestaccuracyofourBackward-BLSTMproposedmodelonthedevelopmentsetwithandwithoutdropout.-52- 第3章Backward-BLSTMmodelfortheCCGSupertaggingtask3.4ExperimentResultsInthissection,wecompareourapproachwiththepreviousstate-of-the-artexistingmodelsforCCGSupertaggingandmultitagging.3.4.1SupertaggingResultsTotestthefeasibilityofourbackward-BLSTMmodelweconduct2kindsofexperi-ments:firstly,weusethefullsetoflabelsthatappearinthetrainingdataintotalof1286labels,second,followingthestate-of-the-art,weconductexperimentsonthesetof425labelsthatappearmorethan10timesinthetrainingset.Recurrentmodelsbecamestate-of-artatsequentialtasks.Xuetal.(2015)reportthehighestaccuracyonthedevelopmentset(93.07percent).Besides,LewisandSteedman[23](91.10percent)andClarkandCurran[14]alsoreachaccuracyabove(92.6percent)withgoldPOStags.Thesethreesystemsareconsideredasstate-of-the-artsystemsinCCGsupertagging.Nevertheless,thebestmodelaccuracyreportedbyXuetal.,[45]hasaccessonlytothepreviouscontext,whileourmodelthankstothestrongabilityofBLSTMnetworkscanaccesstoinformationinbothpreviousandfuturecontext.Table3-4presentsthemostlikelypredictedsupertag(1-bestaccuracy)resultsofourBackward-BLSTMmodelonthedevelopmentsetcomparingwithallpreviousworks.表3-41-bestaccuracyonthedevelopmentset(Section00).ModelAccuracyC&C(goldPOS)92.60C&C(autoPOS)91.50NN91.10RNN93.07Ours94.09AspresentedinTable3-4,ourmodelimprovestheresultsandproducedthebestperformanceandoutperformstheC&CsupertaggerwithgoldPOS(+1.5percent)andgiveshighaccuracythanRNN[45](+1.02percent).Thisimpliesthatourmodelmanagestolearnlong-termdependenciesfromthedata.Asanoverallevaluationonthetestset,Table3-5showscomparabletestingresults-53- 哈尔滨工业大学工学博士学位论文intermsof1-bestaccuracyonSection23fromtheCCGBankcorpus.Ourproposedsupertaggersignificantlyoutperformsthestate-of-the-artsystems.Itisclearthatoursupertaggerisverycompetitive,despiteusingverysimplearchitectureandalargenumberofCCGsupertags.Moreover,wealsotestedourmodelontwoout-of-domaintestset.ItcanbeseenthatourBackward-BLSTMmodelyieldsbetterresultsandismuchbetterperformancethanthepreviousmodelswithalltestdata.TheonlyoneexceptionistheC&Csupertaggerwithgold-standardPOStags,inwhichweunderperformtheirresultsinBio-GENIA(weusedBio-GENIAgold-standardCCGlexicalcategorydatafromRimellandClark[103]asnogoldcategoriesareavailableintheBioinferdata).表3-51-bestaccuracyonthetestset.ModelSection23WikiGeniaC&C(goldPOS)93.3288.8091.85C&C(autoPOS)92.0288.8089.08NN91.5789.0088.16RNN93.0090.0088.27Ours94.2590.6288.55TomakeadirectcomparisonwiththeclosestworktoourspresentedbyLewisetal.,[58]andVaswanietal.,[59]usingLSTMarchitectures,weconductexperimentsusingthesamesetof425labels.TheresultsarereportedinTable3-6.表3-61-bestaccuracycomparison.ModelSection00Section23LabelSizeLewisetal.,201694.194.3425Ours94.2894.47425Vaswanietal.,201694.08–1286Vaswanietal.,2016+LM+Beam94.2494.51286Ours94.0994.251286-54- 第3章Backward-BLSTMmodelfortheCCGSupertaggingtaskFromTable3-6,wecanseethatourmodelisnomorethan0.01%lowerinaccuracytothemodelproposedbyLewisetal.,[58]onsection00usingthewholelabelset(1,286labels)andis(+0.12%)usingthesetof425labels.ThismodelusedadeeparchitectureofBLSTMwithasubsetof425lexicaltags.WhileourmodelachievesthesamelevelofaccuracyasVaswanietal.,[59]onsection00andisslightlyloweronsection23(-0.03%).Comparedwiththelatter,weshouldconsiderthatourmodelismuchsimplerintermofarchitectureandourhiddenstateisthirtypercentsmaller(400versus512).Vaswanietal.,[59]usedadifferentarchitecturewithdifferenttrainingprocedurebasedonBLSTM+LanguageModel+beamencoding.Themajordifferencesareresumedasfollow;ourmodelintroducedbackwardLSTMtocapturelong-rangecontextualinformationandeliminatetheneedforcontextwindowsandusedBLSTMtocaptureinformationfrombothpastandfuturedirections.Thesemakeoursupertaggermoreaccurateinrecoveringlong-distancedependenciesandmuchsimplerandcomparabletotherecentmodelsproposedbyLewisetal.,[58]andVaswanietal.,[59].3.4.2Multi-taggingResultsFollowingClarkandCurran[40]andCharniaketal.,[104],thesupertaggercanpoten-tiallyassignmorethanonesupertagtoeachwordwhoseprobabilitiesarewithinsomefactors.Forcategorieswhoseprobabilitiesarenotwithinfactor,theprobabilityofthe1-bestcategoryispruned.Tovalidateourapproach,wealsoconductmulti-taggingexperiments.Weevalu-atedourmulti-taggerusingthesamelevelsintroducedin[14].Forthemulti-taggingexperiments,wecalculatedperwordaccuracy,whereweconsiderthewordtobetaggedcorrectlyifthecorrectcategoryisincludedinthesetoftheassignedcategories.Wealsocalculatesentenceaccuracywhichisthepercentageofsentenceswhosewordsarealltaggedcorrectly.Wecompareourproposedmodelwithstate-of-the-artmethodsformulti-tagging.TheresultsarereportedinTable3-7.Inthiscase,itisobservedthatourmodelincreasestheperformanceofbothWORDandSENTaccuraciesonalllevels(0.075,0.030,0.010,0.005and0.001).-55- 哈尔滨工业大学工学博士学位论文表3-7Performancecomparisonofdifferentmodelsformulti-taggingaccuracyonSection00fordifferentlevels.OursRNNNNC&C(autopos)C&C(goldpos)WordSENTWordSENTWordSENTWordSENTWordSENT0.07597.3668.2697.3366.0796.8361.2796.3460.2797.3467.430.03098.1575.9598.1274.3997.8170.8397.0565.5097.9272.870.01098.7182.0198.7181.7098.5479.2597.6370.5298.3777.730.00599.0585.4199.0184.7998.8483.3897.8672.2498.5279.250.00199.5391.2999.4190.5499.2989.0798.2580.2499.1787.193.5SummaryIntermoflearninglongdependencies,LSTMgivesgoodperformanceoverstandardRNNbasedmodels.Insomecases,theinformationontherightsideisveryimportant.BackwardLSTMisverypowerfulincapturingtheinformationforalongtime.However,theLSTMhiddenstatetakesinformationonlyfromthepast,knowingnothingaboutthefuture.Ontheotherhand,forwardLSTMisveryefficientonmemorizinginformationontheleftcontext,butinsomecases,itismoreimportanttoobservethepreviouscontextratherthanthefutureone.InCCGlabelingtask,wehaveaccesstobothleftandrightinformationcontext(previousandfuture).Inthischapter,wedemonstratedtheadvantagesofBLSTMthansimpleRNNsandGRUsfortheCCGSupertagging.Ourproposedmodeloutperformedpreviousresultsonsupertaggingandmulti-taggingcomparingtostate-of-the-artmodelsonvariousbenchmarkdatasets.AnanalyzeofourexperimentsresultsindicatetheneedofBLSTMtocaptureinformationinbothdirections.ThemainfindingsfromthedirectcomparisonofourBackward-BLSTMmodelagainstthestate-of-the-artexistingmodelsareasfollows:(1)ourBackward-BLSTMmodelreachesahigheraccuracyscore.(2)ItissignificantlybetterabletotrainthesupertaggeronthefullsetofCCGlexicalcategoriesobservedduringtraining.(3)Itoutperformsevenforsupertaggingandmulti-tagging.OurmainfindingssupportthehypothesisthattheLSTM-basedmodelsaremorepowerfulinmodelingsequentialdata.TheimprovementscanonlybeduetoBLSTMarchitectureadvantages.Overall,theresultswepresentinthischapterindicatesthatallofourresultsarecomparablewithstate-of-the-artresults.Ourresultsarepromisingand-56- 第3章Backward-BLSTMmodelfortheCCGSupertaggingtaskshowthatourmodelcancompetewith,andinmostcasesoutperform.AlthoughBLSTMperformsreasonablywellfortheCCGsupertaggingtask,itusessentencelevelrepresentationprocessingasequencewithoutanycorrelationsbetweenlabelsinneighborhoodswhichhavegreatinfluencesonpredictingthecurrentlabel.Usingagoodnetworkthatcanlearnsentencerepresentationwherewecangainfrombothpastandfutureinputfeaturesandcanusesentenceleveltaginformationmightbebeneficialforourtask.Inthenextchapter,weplantouseacombinationofmachinelearninganddeeplearningmodelswhichcanmakeuseofbothtagandsentencelevelsrepresentationsfortheCCGsupertaggingtask.-57- 哈尔滨工业大学工学博士学位论文第4章BLSTM-CRFmodelfortheCCGSupertaggingtask4.1IntroductionMachinelearningmethodsweresuccessfullyappliedtotheCCGsupertaggingtaskincludingMEmodels[14]andNNwithCRFs[23].MachinelearningmodelstreattheCCGsupertaggingtaskasastructuredpredictionproblemandtrytojointlypredicttheentiresequenceoutputbutrequireextensivefeatureengineeringsuchaslexicalfeatures(POStags)toprovidegoodresults.Ontheotherhand,deeplearningmodelssuchasRNNsandBLSTMsusedifferentmethodstoautomaticallyextractfeaturesthatcontaininformationaboutthecurrentwordanditsneighboringcontextwhileonlyrequiringasequenceoftokensasinput.Insimplerecurrentnetworks,theycontaintheentiresentenceperformingprocessingofsentencewiththeoutputdependingonthepreviouscomputations.Bidirectionalrecurrentnetworkscontaintheentiresentenceandperformcomputationsfromtheprecedingandfollowingdirections.InthepreviousChapter,wehavedescribedLSTMbasedmodelsfortheCCGsupertaggingtask.LSTMsareconsideredasthebestmodelsinassigningCCGlexicalcategoriestoagivensentencebasedontheirabilitytoretaininformationforlonghistoricaltimedependencies,aswellastheirabilitytoworkwithbothpastandfutureinformationwhenBLSTMsareexploredforthetask.However,itiswellknownthatevenBLSTMshaveshowntobeextremelygoodatmemorizinginformationforalongdistancetheystillpredicteachwordoutput(label)inisolationwithoutanyregardstothepreviouslypredictedsupertagsandnotaspartofasequence.Machinelearningmodelsanddeepnetworkshavetheirowncapabilitiesandshort-comings.Insimplerterms,whiledeeplearningmodelsattempttobenefitfromrecognizingsamplesinthesurroundinginputfeatureswithoutrelyingonanyfeaturesengineeringandlearntopredicttheoutputsforthesequencebyrequiringonlyplaintextasinputwithoutanyinformationaboutthepreviouslypredictedsupertags,themachinelearningmodelslikeCRFevennecessitatemanyhand-craftedfeaturesbutstillbenefitfromtheknowledge-58- 第4章BLSTM-CRFmodelfortheCCGSupertaggingtaskaboutadjacentlabelpredictions(surroundingoutputs).Capturingdependenciesbetweenpredictionsbymodelingthedependenciesbetweentheinputrepresentationsusingdeeplearningmodelsorbymodelingthestructuraldepen-denciesbetweenoutputpredictionsusingmachinelearningalgorithmsareveryimportantandbeneficialfortheCCGsupertaggingtask.Forthisreason,inthisChapter,webenefitfromthetwoapproaches.WeintroduceastructuredneuralnetworkarchitecturefortheCCGsupertaggingtask.Themethodisonthebasisofthecombinationofmachinelearn-inganddeeplearningmethods.Specifically,theapproachassignsCCGlexicalcategoriestoeachwordinaninputsentenceintwosteps;inthefirststep,weuseBLSTMnetworkstooperateoninputcontext;themodelisabletomemorizeinformationfromtheprecedingandfollowingwordsforlongspansandlongsequences.Afterward,inthesecondstep,themodelbenefitsfromtheknowledgeaboutneighboringlabelpredictionswhereaCRFlayerisexploitedtojointlypredictthefinalsupertags.Theorganizationofthechapterisasfollows:Section4.2providessomebasicdefi-nitionandnotationoftheLSTMandCRFmodelswiththedescriptionofourparticularapproachsetuptoCCGsupertaggingusingLSTM-CRFcombination.Section4.3de-scribesourexperimentssetupforthetask.Section4.4presentstheexperimentalresultsandSection4.5providestheconclusion.4.2ModelDescription4.2.1BLSTMNetworkLSTMsarethebesttechniquefortheCCGsupertaggingtaskamongthefamilyofRNNtechniquesandexistingconventionalmachinelearningalgorithmsbecausetheyhaveaprovencapabilitytostorelong-rangecontextualinformationandweresuccessfullyappliedtotheCCGsupertaggingtask[58][59].LSTMscanresolvethevanishinggradientproblemsfacedintrainingsimpleRNNsandarebetterthanGRUtolearnoverlongtimesteps.LSTMshavetheabilitytouseitsmemoryblockswhichconsistsofthreegates:inputgate,forgetgate,andoutputgate,togetherwitharecurrentcellasdiscussedinthepreviousChapter(Chapter3)tomakedecisionsonwhatinformationisallowedtobestoredinthememory,readfromitandsavedonit.OneshortcomingofLSTMsisthattheyareonlyabletomakeuseoftheprevious-59- 哈尔滨工业大学工学博士学位论文contextwithoutanyinformationaboutthefuturecontext.BLSTMsarechosenasanextensionoftraditionalLSTMstoovercomethedrawbacksofsimpleLSTMsandcanimprovemodelperformanceontheCCGsupertaggingproblemwhichcanprovidefullerlearningontheproblem.IntheCCGsupertaggingproblem,alltimestepsoftheinputareaccessible,BLSTMstraintwoinsteadofoneLSTMsontheinputsequence.Thefirstoperatetheinputsequencefromthebeginningtoitsendandthesecondusesthereverseddirectionofthesequenceentry,whichisforwardandbackwardpasses,respectively.Inthiswork,wefocusonusingBLSTMnetworks.TofurtherextendthebenefitoftheBLSTMarchitecturetocapturecomplexinputinteractions,ourmodelusesadeeperarchitecturewhere2-BLSTMlayersareusedasshowninFigure4-1.DeepBLSTMscanbecreatedbystackingmultipleBLSTMslayersontopofeachother,withtheoutputsequenceofonelayerformingtheinputsequenceforthenext,inotherwords,eachhiddenlayerreceivesinputfromboththeforwardandbackwardlayersoftheBLSTMatthelevelbelow.图4-1DeepBLSTMarchitecturewith2-BLSTMLayers.ThestrengthofdeeplearningmodelssuchasBLSTMnetworksreliesontheuseofdifferentmethodstoautomaticallyextractfeaturesthatcontaininformationaboutthecurrentwordanditsneighboringcontextwithoutanyneedforhand-craftedorlexicalfeatures.However,itiswellknownthatevenrecurrentnetshaveshowntobeextremelygoodatmemorizinginformationforalongdistancetheirmainweaknessresidewhenpredictingeachwordoutput(label)inisolationandwithoutanyregardstothepreviously-60- 第4章BLSTM-CRFmodelfortheCCGSupertaggingtaskpredictedsupertagsandnotaspartofasequence,comparedtomachinelearningmodelssuchasHMMsandCRFmodelsthatarepowerfulforstructuredpredictionproblemsastheycangainknowledgefromthesurroundinglabels.AninterestingapproachtosolvetheCCGsupertaggingproblemistobenefitfrombothdeeplearningandmachinelearningmethods.Thisisveryimportantbecausepre-dictionwillnotonlydependoninputrepresentationsbutalsodependsonthedependencebetweenoutputpredictions.Inthiswork,wealsobenefitfromtheadvantagesofamachinelearningalgorithmthatwewillcombinewithBLSTMarchitecturedescribedinFigure4-1.Firstly,belowwewilldescribetheCRFmodels.4.2.2ConditionalRandomFieldsTheCCGsupertaggingtaskis,givenasentenceofn-words,assignCCGlexicalcategories(supertags)toeachwordinthesentence.OneapproachtoCCGsupertaggingistoclassifyeachwordindependentlywhichisthecaseofdeeplearningmodels.Theproblemwiththisapproachisthatitassumesthatgiventheinput,alloftheCCGlabelsareindependentandoftenproducesunsatisfactoryresults.Infact,toachievebetterresults,wemusttakeintoaccountthatwearepredictingstructuredoutputsandmodelingtheproblemtoincludeourpreviousknowledge.IntheCCGsupertaggingtask,labelsofneighboringwordsaredependentanditisnecessarytohaveinformationaboutthesurroundingpreviouslypredictedsupertags.Predictingthecurrentsupertagsbytakingintoaccounttheadjacenttagscanbemadein2ways:first,bypredictingadistributionofsupertagsateachtimestep,thenusebeamsearchtofindtheoptimalsequence[59].Second,byrelaxingtheindependenceassumptionthatcanbedonewiththefocusonsentence-levelinsteadofindividualpositionwheretheadjacentoutputvaluesinfluenceeachotherandtakeadvantageofthesurroundinglabels,thusleadingtoConditionalrandomfields(CRF)asoneofthebestperformingstatisticalmodelsformanysequencetaggingtasksbyarrangingtheoutputvariablesinalinearchain.TheadvantageofCRFsoverHMMmodelsistheirconditionalnature,resultingintherelaxationoftheindependenceassumptionsrequiredbyHMMstoensuretractableinference.Additionally,CRFsavoidthelabelbiasproblem[105],aweaknessexhibitedbyMaximumEntropyMarkovModels[106](MEMMs)andotherconditionalmarkovmodels-61- 哈尔滨工业大学工学博士学位论文basedondirectedgraphicalmodels.CRFsoutperformbothMEMMsandHMMsonsomeofreal-worldsequencelabelingtasks[105][107][108].图4-2CRFGraph.CRF[105][109]isafamilyofstatisticalmodelsasprovensupervisedlearningmethodthathasbeenusedextensivelyformanyNLPapplicationsaswellasmanylabelingse-quentialdatatasks.CRFareprobabilisticgraphicalmodelsoftheconditionaldistributionp(y|x)trainedtomaximizeaconditionalprobabilityofstructuredoutputvariablesygivenobservationsx.Whenusedforsequencetaggingproblems,acommongraphstructureusedisalinearchainwithastatetransitionmatrixwherewecanefficientlyusepreviousandfutureoutputstopredictthecurrentoutput.WhenwemodeltheCCGSupertaggingproblem,themostcommongraphstructureisillustratedinfigure4-2.FortheCCGsupertagging,thelinearchainCRFisgivenaninputsequence:x=¹x1;x2;:::;xTº;(4-1)andanoutputstatesequence:y=¹y1;y2;:::;yTº;(4-2)alinear-chainCRFwithparametersWdefinesaconditionalprobabilityfortheoutputsequence(Eq.4-2)asfollows:1∏NP¹y˜jxº=expf¹y˜tº+¹y˜t;y˜t+1ºg;(4-3)Zt=1where¹y˜tºistheunarypotentialforthelabelatpositiont,¹y˜t;y˜t+1ºisthepair-wisepotentialbetweenthepositionstandt+1,andZisanormalizationfactor.-62- 第4章BLSTM-CRFmodelfortheCCGSupertaggingtask4.2.3BLSTM-CRFproposedmodelfortheCCGSupertaggingtaskRecentworksonNERbyHuangetal.,[110]andothershavecombinedthebenefitsoflinearstatisticalmodelswithneuralnetworkstosolvemanysequencetaggingtasks.Inthisapproach,weintroduceastructuredneuralnetworkarchitecturefortheCCGsupertaggingtask.Specifically,theapproachassignsCCGlexicalcategoriestoeachwordinaninputsentenceintwosteps.Inthefirststep,itusesBLSTMnetworktooperateoninputcontextandtoconsidertheinputfeatures;themodelisabletomemorizeinformationforlong-rangedependenciesandfromleftandrightpositions.Afterward,themodelbenefitsfromtheknowledgeaboutneighboringlabelpredictionswhereaCRFlayerisexploitedtoobtainsentenceleveltaginformationandjointlypredictthefinalsupertags.Therefore,theoutputisanoptimaltagsequenceinsteadofmutuallyindependenttagswhichcomprisestwoaspectsforcouplinginputandoutputlevels.OurproposedmodelconsistsofthreemainoperationstopredictthefinalCCGoutputsupertags:InputLayer,BLSTMNeuralNetworkandtheCRFOutputLayer.1.InputLayer:followingCollobertetal.,[11],inputfeaturevectorsarecomputedbylook-uptables,concatenatedtogetherandthenfedtothenetwork.Theinputlayerconsistsof3lookuptablesoffeaturevectorsasinputfeaturesthatarefirstconcatenatedandthenfedintothenetwork,asdescribedbelow:Pretrainedwordembeddings:tocapturethesemanticandsyntacticsimilaritybetweenwordsandreducetherequirementforhandcraftedfeatures,wemakeuseofpre-trainedwordembeddingsasdistributedwordrepresentationswhichmapeachwordtoahighdimensionalvectorspace.Toobtainthefixedwordembeddingofeachwordweuseapre-trainedwordembeddingsmodel.Ourmodelusethepre-trainedGoogle’sWord2Vec300-dimensionalembeddingstrainedon100billionwordsfromGoogleNews[83].FollowingCollobertetal.,[11],allwordsarelower-casedbeforepassingthroughthelook-uptablestoconvertthemintotheircorrespondingembeddingsandalsoallnumbersarereplacedbyasingledigit’0’.Forwordsthatdonothaveanentryinthepre-trainedwordembeddings,the’UNKNOWN’entryfromthepre-trainedembeddingsisused.Twofeaturesthatcontaincharacter-levelinformation,namelycapitalizationandsuffixwasusedinourexperiments.Capitalizationfeature:thecapitalizationfeaturehasonlytwovaluesindicatingwhether-63- 哈尔滨工业大学工学博士学位论文thegivenwordiscapitalizedornot.Suffixfeature:followingthealmoststate-of-the-artexistingCCGsupertaggingmodels,weusesuffixesofsizetwo.2.BSLTMNeuralNetwork:inthesupertaggingfortheCCGgrammar,itisbeneficialtoemployasophisticatednetworksuchasBLSTM[98],whichcanberegardedasapileoftwoLSTMlayers.ThepreviousinputrepresentationsareextractedbyaforwardLSTMlayer,andthefutureinputrepresentationsarecapturedbyabackwardLSTMlayer.Inthisway,wecaneffectivelyutilizethepreviousandfuturefeatures;asdescribedinChapter3.OurneuralnetworkforCCGsupertaggingisconstructedofadeepBLSTMnetwork.The !BLSTMreadstheinputwhereaforwardLSTMcomputesthehiddensequence(ht)and readsinputfromthebeginningtotheendandabackwardLSTM(ht)usestheoppositedirection.Inordertocapturecomplexinteractionsbetweeninputwords,weusedtwolayersofBLSTM,whichisthesameasLewisetal.,[58]wheretheoutputofthefirstBLSTMlayerisusedastheinputrepresentationtothesecondBLSTMlayer.ThentheoutputsfromthesecondBLSTM(backwardandforward)areprovidedasinputtotheoutputlayer.图4-3Theneuralnetmechanism.3.OutputLayer:inourpreviousworks,theoutputsfromtheneuralnetworkateachtimesteparefedintoadenselayerwiththeSoftmaxfunctionaslinearactivationfunction,whoseoutputsizeequalsthenumberofsupertags.ThedifferenceinthisworkisthatwedonotusetheSoftmaxoutputbutratherutilizetheoutputofthedenselayerforanadditionalCRFlayerwhichcomputesthefinaloutputsbyjointlydecodingtheminto-64- 第4章BLSTM-CRFmodelfortheCCGSupertaggingtaskprobabilitiesforeachSupertagformingthebestlabelsequenceofthenetwork.TheCRFlayerensuresmodelingtheoutputprobabilityofthecurrentinputgivenasequenceofneighboringlabelsasillustratedinFigure4-3.ThearchitecturerepresentationofourneuralBSLTM-CRFcombinedmodelfortheCCGsequencelabelingtaskisshowninFigure4-4.图4-4BLSTM-CRFnetworkmodelfortheCCGsupertagging.4.3ExperimentSettingsNeuralnetworksaredifficulttoconfigure,andtherearealotofparametersthatmajorlyinfluencethelearningandtheperformanceofthenetwork,andneedtobewelltunedtofindtheoptimumvaluestoimprovetheaccuracyofthemodel.Inthissection,weprovidedetailsabouthyper-parameterstuningtotraintheneuralnetwork.4.3.1DatasetsWeevaluatedtheeffectivenessofourmodelontheCCGsupertaggingtaskonindomainandout-of-domaindatasets.Section00(1913sentences)oftheCCGBankcorpus[39]isusedasadevelopmentsettoselectourhyper-parametersandSections02–21fortraining.-65- 哈尔滨工业大学工学博士学位论文Supertaggingperformancesarereportedbasedontheaccuracyonsection23(2407sentences)asindomaintestdata,Wikipedia(200sentences)fromHonnibaletal.,[100]andBio-Geniacorpus(1000sentences)fromPyysaloetal.,[101]asout-of-domaindatasets.Similartoourpreviousworks,somestepswereperformedbeforethesupertaggermodelcanbebuiltsuchasallwordswerelowercased,andallsequencesofdigitswereturnedintoasingledigit′0′.Forallsymbols(wordsornumbers)containing′n′,webacked-offtothesubstringafterthedelimiter.4.3.2WordembeddingsAlldatasetsentenceswererepresentedasasequenceofone-hotvectorswhichwerebeingtransformedintoasequenceofwordembeddingsbytheembeddingweights.Theseembeddingweightswereinitializedwithpre-trainedwordrepresentationsandmorespecificallywiththepubliclyavailablepre-trainedvectorscreatedusingword2vec;weused300-dimensionalvectorstrainedonGoogleNews[83].4.3.3OptimizationAlgorithmParameterswereoptimizedusingAdamoptimizer[111]totrainourmodelwithaninitiallearningrateof0.001.WehaveexploredothermoresophisticatedoptimizationalgorithmssuchasSGDandAdeDelta[84]withoutanyimprovementoverAdam.4.3.4DropoutTrainingDeepneuralnetworksaredifficulttotrain,andover-fittingtodataisamajorchallenge.Themostcommonregularizationtechniquetopreventover-fittingisDropout[102].Duringtraining,weapplieddropouttotheinputlayeroffixedrateto0.3thatwasquiteeffectivetoregularizeourmodelandreduceover-fittinggivingsignificantimprovementsinaccuracy.4.3.5Hyper-ParametersTuningImplementationwasdoneinTheano[112]usingtheversion1.2.2oftheKerasdeeplearninglibrary[82]andallmodelsweretrainedonTeslaK40mGPU.Westartbyevaluatingtheperformanceofourmodelonthedevelopmentsetateveryepoch,andthebest-performingmodelwasthenusedforevaluationonthetestset.Thelargerthenetwork,themorepowerfulbutitisalsoeasiertooverfit.Intheexperiments,wetestedtheaccuracyofourmodelonthedevelopmentsetwiththehiddendimensionvaluesrangeinthesetof{100,200,300,400,600,700}andfoundthatthe-66- 第4章BLSTM-CRFmodelfortheCCGSupertaggingtask表4-1Thefinalhyper-parameterssettingsforourmodel.Hyper-parameterValueWordembeddingsWord2VecHiddendimension400OptimizerAdamDropout0.3Learningrate0.001hiddendimensionwithsize400showsthehighestaccuracy.Forsuffixandcapitalization,wefollowedthestate-of-the-artandusedafixedembeddingofsizeequaltofive(5).Wetunedthehyper-parametersthentrainedthemodels.Theresultsofourexper-imentsarereportedwiththebestmodel,whichisselectedbytheperformanceonthedevelopmentset.ThefinalchosenparametersarereportedinTable4-1.4.4ResultsandAnalysisInthissection,wereporttheevaluationoftheperformanceofourproposedBLSTM-CRFmodelforCCGsupertaggingforbothin-domainandout-of-domaindatasets.Wealsoperformmultitaggingexperiments,theresultsarediscussedbelow.4.4.1SupertaggingResults表4-2Performancecomparisonwithstate-of-the-artmethodsonthedevelopmentset.ModelAccuracyC&C(goldPOS)92.60C&C(autoPOS)91.50NN91.10RNN93.07BLSTM94.1BLSTM+LM+Beam94.24BLSTM+Attention94.31Ours94.37Table4-2providestheaccuracyoftheproposedBLSTM-CRFmodelforCCG-67- 哈尔滨工业大学工学博士学位论文supertaggingonthedevelopmentset.AsshowninTable4-2,wecompareourmodelwithbaselineexistingmodelsincludingC&CmodelwithbothgoldPOSandautoPOSproposedbyClarkandCurran[14],thefeed-forwardNNmodelbyLewisandSteedman[23]andtheRNNsupertaggerbyXuetal.,[45]whereourmodelsignificantlyoutperformintermofaccuracy(1-bestpredictedlexicalcategory).ItconcludesthattheuseofBLSTMcanbringbetterperformancethansimplerecurrentnetworksandtheuseofCRFcanmodelmorestructuredependence.Withtheemergenceofdeeplearning,therearelotsofworkonCCGsupertagging.Wealsomakeacomparisonofourmodelwithsomerecentworks.WecomparedourmodelwiththerecentlyproposedmodelsbasedonBLSTMarchitecturesincludingLewis’setal.,[58]modelbasedondeepBLSTMnetwork,Vaswani’setal.,[59]modelbasedonBLSTMandenhancedwithaLanguageModel(LM)andbeamsearchwhileXu[49]usedBLSTMwithattentionmodel.Ourmodelgainmoreaccuracy(orcloseto)thandeeplearningmodelsbasedonBLSTMarchitectureswhichshowthatmodelingoutputwithastructuredmodelasCRFisveryimportantforCCGsupertaggingasasequencelabelingtask.SocombiningBLSTMwithCRFishelpfulinthistask.TheresultsreportedinTable4-3presenttheevaluationofourmodelwithexistingmodelsonthetestset(Section23oftheCCGBank).Furthermore,wealsoevaluateourmodelontwoout-of-domaindatasetsnamelyWikipediaandBio-Geniacorpus.表4-3Performancecomparisonwithstate-of-the-artmethodsonthetestset.ModelSection23WikiGeniaC&C(goldPOS)93.3288.8091.85C&C(autoPOS)92.0288.8089.08NN91.5789.0088.16RNN93.0090.0088.27BLSTM94.30––BLSTM+LM+Beam94.5––BLSTM+Attention94.46––Ours94.4990.388.51-68- 第4章BLSTM-CRFmodelfortheCCGSupertaggingtaskTheresultsofthetestsetarecompetitive,evenwhencomparedtopreviousworkusingmanyfeatures,thenetworkachieves94.49%onSection23comparedto93.32%and90.3%onWikidatacomparedto88.80%byClarkandCurran[14]withgoldPOS.Insomecases,wearealsoabletobeat(orcloseto)thebestresultswhereweobtain94.49%onSection23comparedto94.46%byXu[49]andourmodelisnomorethan0.01%loweraccuracytothemodelproposedbyVaswanietal.,[59]astheirmodelisenhancedwithalanguagemodel.However,ClarkandCurran[14]reportaconsiderablyhigherresultof91.85%onBio-Geniadatasets,comparedtotheexperimentspresentedhere,theirmodelusedthegoldcategorieshoweverinourexperimentsnogoldcategoriesareavailableintheBio-Geniadata.4.4.2Multi-taggingResultsItisalsoimportanttocompareourmodelformulti-tagginginwhichweaimtoincreasethenumberoftheassignedlexicalcategoriestoeachword.Weperformmulti-taggingexperimentstomakeourBLSTM-CRFsupertaggermoreaccuratewherethesupertaggerisabletoassignmorethanonelexicalcategorytoeachwordwithinafactor.WeusedthelevelsdefinedbytheC&Cparser[14]todefinecut-offsformulti-taggingbasedontheprobabilitiesfromtheBLSTM-CRFmodel.表4-4Performancecomparisonofdifferentmodelsformulti-taggingaccuracyonSection00fordifferentlevels.OursRNNNNC&C(autopos)C&C(goldpos)WordSENTWordSENTWordSENTWordSENTWordSENT0.07597.3568.1297.3366.0796.8361.2796.3460.2797.3467.430.03098.1275.9298.1274.3997.8170.8397.0565.5097.9272.870.01098.7281.9598.7181.7098.5479.2597.6370.5298.3777.730.00599.0185.2399.0184.7998.8483.3897.8672.2498.5279.250.00199.4991.1999.4190.5499.2989.0798.2580.2499.1787.19Theperformanceformulti-taggingismeasuredforbothWORDaccuracywhereweconsiderthewordtobetaggedcorrectlyifthecorrectcategoryisincludedinthesetoftheassignedlexicalcategoriesandSENT(sentence)accuracywhichisthepercentageofsentenceswhosewordsarealltaggedcorrectly.Theresultsoftheseexperiments-69- 哈尔滨工业大学工学博士学位论文arepresentedinTable4-4whereTheWORDcolumngivesthewordaccuracies,andtheSENTcolumngivesthesentenceaccuracies.ItcanbeseenthatourmodelresultsimproveperformanceoneverylevelsthanthepreviouslyproposedmodelsforbothWORDandSENTaccuracy.4.5SummaryDevelopmentofdeeplearningmodelsforCCGsupertaggingtaskisapowerfulcomplementtoclassicalmachinelearningmodelsthatworkwellwithoutrequiringanylexicalorhand-craftedrepresentations.Whiledeepnetworksarepowerfulformodelinginputsequences,thesemodelsstillpredicttheoutputwithoutanyregardstothepreviouslypredictedlexicalcategories.Inthischapter,theproposedmethodemploysacombinationofbothdeeplearningandmachinelearningmethodstoperformrepresentationlearningjointlyoverbothinputsandoutputs.WehavedescribedacombinedBLSTMandCRFmodelsbasedapproachforautomaticCCGsupertagging.TheBLSTM-CRFcombinationbasedsupertaggerperformreasonablybettercom-paredtothemachinelearninganddeeplearningsupertaggers.AkeyaspectofourmodelisthatitmodelsoutputlabelsviaasimpleCRFarchitecture,andinputwordsviaBLSTMnetworkscapturingcomplexinteractionsbetweenwordsandmemorizinginformationforlonghistoricaltimefrombothpastandfutureinputdirections.Themodeldescribedhereissimpleandquiteeffectiveforsupertagging.ThebestperformanceisachievedfortheBLSTM-CRFmodelonin-domainandout-of-domaindatasetsshowingthatthecombinedmodelisefficientandpowerfultosupertaggingfortheCCGgrammar.-70- 第5章Character-WordembeddingsfortheCCGSupertaggingtask第5章Character-WordembeddingsfortheCCGSupertaggingtask5.1IntroductionMachinelearninganddeeplearningmethodshaveallbeenprovedtobeeffectiveinsolvingtheCCGsupertaggingproblem.However,someexistingapproachesheavilyrelyonfeatureengineeringswhererecentworksarebasedonneuralnetworkarchitecturesthatareabletoachieveimprovedresults,whileonlyrequiringasequenceoftokensasinput[11].Theemergenceofdeepneuralnetworksaimatbuildingdeepandcomplexencoderstotransformasentenceintoencodedvectorsandhavereachedstate-of-the-artperformanceintheNLPfieldandrelyonlyonwordembeddingstocapturesimilaritybetweenwordsbyreplacingeachwordinalower-dimensionaldistributionandinitializetheweightsofembeddingslayerwithpre-trainedwordvectorssuchasTurian[41]andword2vec[83]embeddingswhichenablethemtolearnsimilaritybetweenwordswithoutrequiringanylexicalfeatures.However,theeffectivenessofwordembeddingsislimitedbyunseenandwordsofverylowfrequencyinthetrainingdatawhereembeddingsdonotexist.Inotherwords,themostobviousproblemofwordembeddingsbasedmodelsarerelatedwhendealingwithwordsthatdon’tappearinthepre-trainedwordembeddingvectors-ifasymbol(token)hasbeenseenrarely,ithasanembeddingsentry,however,itwillbeoflowquality,anotherimportantcaseiswhensymbolsdidn’tappearbefore,then,ithasnoentriestotheembeddingsandthemodelneedstoback-offtotheOutOfVocabulary(OOV)representation.Toaddressthisissue,weexploredeepneuralnetworkembeddingsbasedmodelsforhandlingrareandunseenwordsbycombiningCharacterandWordembeddings.WeimprovethatCharacter-basedmodelrevealssimilaritiesbetweenwordsandcanbeusedtomodelinfrequentandunknownwords.InthisChapter,weproposeaneuralsequencelabelingarchitectureforCCGsu-pertaggingwhereweapproachthechallengeofunseenandrarewords.WeusetheBLSTMneuralnetworkforword-levelrepresentation.Todealwiththedrawbacksof-71- 哈尔滨工业大学工学博士学位论文thewordbasedmodel,aCharacterBLSTMrepresentationmodelisaddedtothewordmodel,thereby,thecombinedmodelcaninferrepresentationsforpreviouslyunseenandrarewords.Thechapterisorganizedasfollows.Section5.2introducesourneuralnetworkarchi-tectureusedforCCGsupertagging.Next,Section5.3givesdetailsaboutourexperimentsandSection5.4providesexperimentalresultswithacomparisonwithpreviousworksforbothsupertaggingandmulti-tagging.Finally,insection5.5weconcludethechapter.5.2Character-WordembeddingsproposedmodelfortheCCGSu-pertaggingtaskDespitetherelativelylargeamountofworkdoneonCCGsupertaggingproblem,therehasbeennoworkaddressingthedegradationoftheperformanceonout-of-domaindatasetswherethemainreasonisOOVeffects[113][114]asthemodelperformancesuffersbecauselexicalknowledgeisnotavailableinthepre-trainedwordembeddingsforthesewords.Thereby,toachievehigheraccuracyinCCGsupertagging,itisalsoimportanttohaveagoodmodeldealingwithunknownandrarewords.Inthissection,wediscussthemodelweemployedtopredictaCCGlexicalcategoryforeachwordinasequenceinput.Forourmodel,threestepsarerequiredtoassignCCGsupertagstoagivenstringorlistoftokens:•first,trainabasicword-levelneuralnetwork,•next,trainacharacter-levelrepresentationsneuralnetworkand,•finally,combinethetwoarchitecturesinordertopredictthefinaloutput.5.2.1Word-LevelNeuralNetworkGivenasequenceofwordsasinput,wefirstdescribethebasicword-levelneuralnetworktowhichtheinputlayerofinputvectorsisfed.Wordembeddingshavebeenprovedtobeusefulforvarioustasks,suchasPOSTagging[11],sentenceclassification[115],sentimentanalysis[116],sarcasmdetection[117]andCCGsupertagging[23].Themodelreceivesasinputasequenceofwords(W1;W2;:::;Wm),wheretokensaremappedtowordembeddingslayerinitializedwithpretrainedvectors,resultingina-72- 第5章Character-WordembeddingsfortheCCGSupertaggingtasksequenceofwordembeddings(eW1;eW2;:::;eWm).IthasbeenprovedthatBLSTMnetworksareverypowerfulfortheCCGsupertaggingtask[58][59]andourpreviousworksinChapters3and4.Tobettermemorizeinformation,theinputrepresentationsfromthepre-trainedwordembeddingsarethenfedintothewordlevelnetworkasapartialnetwork,whichconsistsoftwoLSTMRNNslayers—abackwardLSTMtobettermemorizeinformationfromthepastandaforwardfortheoppositedirectionperformingcomputationonbothprecedentandnextwordinputsasfollows: ! !ht=LSTM¹eWt;ht 1º;(5-1) ht=LSTM¹eWt;ht+1º:(5-2)Next,therespectiveLSTMrepresentations(backwardandforward)areconcatenatedforeachwordrepresentations(equation5-3)asdepictedinFigure5-1. ! ht=»ht;ht¼:(5-3)Ourbestmodelusesthepre-trainedGoogle’sWord2Vec300-dimensionalembed-dingsfromGooglenews[83].FollowingCollobertetal.,[11]allwordsarelowercasedbeforepassingthroughthelookuptablestoconvertthemintotheircorrespondingembeddingsandalsoallnumbersarereplacedbyasingledigit’0’.Forwordsthatdonothaveanentryinthepre-trainedwordembeddings,the’UNKNOWN’entryfromthepre-trainedembeddingsisused.FollowingLewisandSteedman[23]twosetsoffeaturesareusedinourexperimentsnamelycapitalizationthathasonlytwovaluesindicatingwhetheragivenwordiscapitalizedornotandsuffixesfeatureofsizetwo.图5-1Wordlevelneuralnetwork.-73- 哈尔滨工业大学工学博士学位论文Inourword-levelbasedmodel,theinputwordstotheBLSTMlayerateachtimesteparethesequenceofpre-trainedwordembeddingswherewordsthathavesimilarmeaningcanbemadetocorrespondtoclosevectorrepresentations.However,usingsuchembeddingsinaparticulardomainsuchasBio-GeniacorpusleadstotheOOVproblem:whereNoembeddingsfordomain-specificwords.Forexample,therearesomewordsfromtheBio-Geniadatasetthatarenotpresentinthepre-trainedvectorsreleasedbyGoogleandeveninthetrainingdataoftheCCGBankcorpus.Currentword-basedmodelsareweaktohandleOOVwords,theaimofthisworkistohandlethisweaknessoftheexistingmodels.ThischallengesustoapproachtheproblemofOOVbyaCharacter-levelneuralnetworkspecializedtodealwiththesewords.5.2.2Character-LevelNeuralNetworkSeveraltechniquesforreducingOOVeffectshavebeenintroducedintheliterature.Anadequatesolutionistooperateonindividualcharactersofeachtokenascharactersmayalsoplayanimportantroleinmodelingsemanticmeaningsofwords.Researchintocharacterembeddingsmodelsisstillinthefairlyearlylevelofdevelopment,andmodelsthatoperateexclusivelyoncharactersarenotyetbetterthanword-levelmodelsonmosttasks.WeproposetoaddresstherarewordsprobleminCCGsupertaggingtaskbytrainingcharacterembeddingsneuralnetworkbasedmodel;however,insteadoffullyreplacingwordembeddings,weareinterestedincombiningthetwoapproaches,therebyallowingthemodeltotakeadvantageofinformationfrombothinputrepresentations(wordsandcharacters).Inthecharacterlevelrepresentation,eachwordisdividedintoindividualchar-acters(C1;C2;:::;Cn)thataremappedtoalook-uptableofcharacterembeddings(eC1;eC2;:::;eCn)andthenfedintoBLSTMnetworktoperformcomputationsonbothpreviousandfutureinputsequenceasshownisFigure5-2.ThecharacterembeddingsaregeneratedbytakingthefinalhiddenstatesoftheBLSTMappliedtoembeddingsofcharactersforeachtoken.WethenusethelasthiddenvectorsfromeachoftheLSTMcomponentsandconcatenatethemtogetherasfollows:-74- 第5章Character-WordembeddingsfortheCCGSupertaggingtask ! !ht=LSTM¹eCt;ht 1º;(5-4) ht=LSTM¹eCt;ht+1º;(5-5) ! ht=»ht;ht¼:(5-6)图5-2Characterlevelneuralnetwork.5.2.3ConcatenationThegeneraloutlineofourapproachisshownintheFigure5-3.Theinputstotheword-levelnetworkarepre-trainedwordembeddingsrepresentations(seeSection.5.2.1),andindividualcharacterstothecharacter-levelnetworkdescribedinSection5.2.2.Now,wehavetwoalternativefeaturerepresentationsforeachword;oneistheembeddingslearnedonthewordlevel,andthesecondistherepresentationbuiltfromindividualcharactersonthet-thwordoftheinputtext.FollowingLampleetal.,[118],theapproachistoconcatenatethetworepresentationsanduseitasthenewrepresentationinordertogeneratetheprobabilitydistributionovertagsforeachwordinput,suchthatthemodelcanachievebetterperformance.TheoutputsfromeachrepresentationarefirstconcatenatedandusedasthenewinputrepresentationtoanotherBLSTMlayerasthefinalsequencelabelernetwork.Afterward,aSoftmax-75- 哈尔滨工业大学工学博士学位论文activationfunctionisusedtodecodetheoutputsfromBLSTMasprobabilitiesforeachlexicalcategory(CCGsupertags).图5-3Word-CharacterbasedembeddingsmodelfortheCCGsupertagging.5.3ExperimentssettingsInthissection,weprovidedetailsabouthyper-parameterstotraintheWord-Characterneuralnetwork.5.3.1DatasetsTobecomparablewiththeresultsreportedbypreviousworkonCCGsupertaggingmodels[14][23][45][58][59],wecarryoutthesimilardatasetformyexperiments:theCCGBankcorporawiththesameregulardivisionfortraining,developmentandtestsections.Sinceitisimportanttoprovetheeffectivenessofourapproach,wechosetotesttheperformanceofourmodelsonout-of-domaindatasetnamelyBio-Geniacorpus(1000sentences)fromPyysaloetal.,[101].Foreachdataset,thefollowingstepswerenecessarytopreparethedatafortheexperiments:-76- 第5章Character-WordembeddingsfortheCCGSupertaggingtask•Allwordscontaininguppercasewerelowercase.•Alldigitswereconvertedtoasingledigit:′0′.•Forwordsthatcontainthe′n′delimiter,weback-offtothestringbeforethedelimiter.Pre-trainedwordembeddingscapturedsemanticsbetweenwords.Thewordembed-dingswereinitializedwiththe300-dimensionalpubliclyavailablepre-trainedvectors,createdusingWord2Vec[83].Whileusingsuchpre-trainedwordembeddings,somewordsarenotpartoftheoriginaltextcorpusonwhichthewordembeddingswerepre-trainedwhicharecalledOOVwords.Sometimestheseunseenwordsmightbethoserarekeytermswhichareimportantforthesemanticsofthewholetext.Inourexperiments,suchwordswerereplacedbythegeneric’UNKNOWN’tokenforthepre-trainedwordembeddingsbutwerestillusedinthecharacter-levelcomponents.Withtheemergenceofdeeplearning,multiplesoftwarepackagesprovideimplemen-tationsofdeepnetworkmodels,TheimplementationofourneuralnetworksisconductedusingKeras[82]withTheanobackend.Kerasprovidesahigh-levelAPIforneuralnetworksenablingquickexperimentation.Weusetheversion2.0ofKerasonTeslaK40mGPU.Forevaluation,weadopttheofficialevaluationmetricfortheCCGsupertaggingtasktoevaluateourproposedmodelwhichis1-bestaccuracy(themostlikelypredictedsupertag).5.3.2Hyper-ParametersWetestedourneuralnetworkperformancewithvaryingparameters;thefinalchosenhyper-parameterswereselectedaccordingtotheperformanceonthedevelopmentsetgivingthebestaccuracy.TheLSTMhiddendimensionsweresetto256forbothwordandcharactercom-ponents.TheoptimizationalgorithmusedtotrainourmodelwastheAdamoptimizerwithafixedlearningrateof0.001.Performanceonthedevelopmentsetwasmeasuredateveryepoch,andthebest-performingmodelonthedevelopmentsetwasthenusedforevaluation.Anoftenencounteredproblemintrainingneuralnetworksisover-fitting.Ourmodelwasregularizedwithdropouttechniquewhichwasappliedtoeachlayeroftheinputembeddingswithafixedprobabilityof0.5.TheBLSTMneuralnetworkisusedthroughoutourmodel;weutilizeonelayerof-77- 哈尔滨工业大学工学博士学位论文BLSTMtocomputecharacter-levelembeddingsandonelayerBLSTMtocomputewordlevelembeddingscombinedtogetherandthefedtoanotherBLSTM.Fortheoutputlayers,weusedtheSoftmaxactivationfunctionasthemostpopularactivationfunctionusedinsequencelabelingproblems,todecodeeachoutputateverytimestepintoprobabilitiesforeachsupertagandensurethatalltheoutputsrangefrom0to1andtheirsumis1.5.4ResultsandAnalysisInthissection,wewillcovertwosetsofexperimentsresultstoevaluatetheproposedapproachbasedonthecombinationofcharacterandwordlevelembeddingsonbothin-domainandout-of-domaindatasets,oneissupertaggingandthesecondismulti-tagging.5.4.1SupertaggingresultsWecomparedourWord-CharactercombinedBLSTMsupertaggeragainstmachinelearninganddeeplearningstate-of-the-artbasedmodelsincludingtheMEmodel[14],feedforwardNN[23],simpleRNN[45].WealsocomparedourresultswiththeBLSTMbasedproposedarchitectures:themodelproposedbyLewisetal.,[58]trainedwith2-layerdeepBLSTM,thearchitecturedevelopedbyVaswanietal.,[59]trainedonacombinedarchitectureofBLSTM,languagemodelandbeamsearchtogeneratethefinaloutputs.表5-1Accuracyresultsonthedevelopmentset.ModelAccuracyC&C(goldPOS)92.60C&C(autoPOS)91.50NN91.10RNN93.07BLSTM94.1BLSTM+LM+Beam94.24Ours94.35Table5-1lists1-bestaccuracyofthemodelspredictingthebestCCGlexicalcategoryonthedevelopmentset.Inthestate-of-the-artproposedmodels,allmethodsobtaingoodresultsonthistask,theBLSTMarchitecturesreachthehighestscores,withanaccuracyoutperformingfeedforwardNNandvanillaRNNsaswell,themainreasonisthatBLSTM-78- 第5章Character-WordembeddingsfortheCCGSupertaggingtasknetworksareverystronginmodelingsequentialdataandmemorizinginformationfrombothsidesofaninputforlongperiodsoftime.Beyondthis,ournetworkoutperformsallothernetworks,achievingstate-of-the-artperformancesdemonstratingthataddingmoreinformationwiththecharacter-levelasinputforcesthemodelandimprovetheresults.表5-2Accuracyresultsonthetestset.ModelSection23GeniaC&C(goldPOS)93.3291.85C&C(autoPOS)92.0289.08NN91.5788.16RNN93.0088.27BLSTM94.30–BLSTM+LM+Beam94.5–Ours94.4688.85Table5-2showsthefinalresultsoftheCCGBanktestdata.ToevaluatehowwelltheCharacter-levelcombinationwithWord-basedmodeldo,wealsotestourmodelontheBio-Geniacorpusasout-of-domaindatasets.AsreportedinTable5-2,allBLSTMbasedmodelsobtaingoodresultsonthistask.Itcanbeseenthattheaccuracyhasbeenimprovedsignificantlyinbothin-domainandout-of-domaindatasets.Someofourresultsonthetestsetmayseemveryclosetoothers,thisslightlackofgeneralizationonthetestsetmaysuggestthatmorefineparameteroptimizationsmayleadtoevenbetterresults.Vaswanietal.,[59]obtainthebestresultsonSection23(in-domaintestdata).Ourmodelusedasimplifiedarchitecturewith1-BLSTMlayerasthesequencetagger,whilethelatterusedacombinationofBLSTM,languagemodelandbeamsearchforoutputgen-erationmakingthemodelverystrong.DespitethatourmodelshowsgoodimprovementsonBio-Geniadata(+0.3%).TheonlyoneexceptionistheC&CsupertaggerasweusedBio-GENIAgold-standardCCGlexicalcategoriesdatafromRimellandClark[103]sincenogoldcategoriesareavailableintheBio-inferdataweunderperformtheirresults.TheabilityofWord-level,togetherwithCharacter-leveltoencodeinputrepresenta-tionsmakesourmodelaveryeffectivemodelfortheCCGsupertaggingasastructured-79- 哈尔滨工业大学工学博士学位论文predictiontaskshowingthataddingcharacterinformationasinputforcesthemodeltohandleOOVwordsforbothin-domainandout-of-domaindata.5.4.2Multi-taggingResultsWeexaminetheeffectivenessoftheproposedarchitectureforotherexperiments,weconductedmulti-taggingexperimentswithdifferentlevelsasdescribedinChapter2.5.2.Bydoingso,wecanassignmorethanonelexicalcategorytoeachwordinaninputsentence.Wecomparedourresultsonsection00withthepreviouslyproposedmulti-taggersaslistedinTable5-3.Table5-3reportsexperimentresultsformulti-taggingwithSENTcolumnasthepercentageofsentenceswhosewordsarealltaggedcorrectlyandWORDcolumnastheaccuracyofwordstobetaggedcorrectlyifthecorrectcategoryisincludedinthesetoftheassignedcategories.Asshownintable5-3,mostofourresultsobtainedwiththecombinationofCharacterandWordembeddingsmodelarestate-of-the-artintermsofbothWORDandSENTaccuracyamongthedifferentlevels.Overall,theCharacterandWordembeddingsmodelalsodemonstrateitssuperiorityinmulti-taggingasinthesupertagging.表5-3Performancecomparisonofdifferentmodelsformulti-taggingaccuracyonSection00fordifferentlevels.OursRNNNNC&C(autopos)C&C(goldpos)WordSENTWordSENTWordSENTWordSENTWordSENT0.07597.3467.9997.3366.0796.8361.2796.3460.2797.3467.430.03098.1775.9998.1274.3997.8170.8397.0565.5097.9272.870.01098.7282.0698.7181.7098.5479.2597.6370.5298.3777.730.00599.0785.6199.0184.7998.8483.3897.8672.2498.5279.250.00199.4891.5799.4190.5499.2989.0798.2580.2499.1787.195.5SummaryInthischapter,weproposedanovelsequencelabelingframeworkforCCGsupertag-gingwithasecondaryobjective-overcomeOOVwordsinthetrainingandout-of-domaindatasets.OnebidirectionalLSTMistrainedforwordinputs,whereasanotheroneistrainedforindividualcharactersofeachword.Atthesametime,bothofthoseare-80- 第5章Character-WordembeddingsfortheCCGSupertaggingtaskcombined,inordertopredictthemostprobablelabelforeachword.ThemodelwehavedescribedisasimpleandeffectivebasedonCharactersandWordembeddingsapproachfortheCCGsupertaggingtask.Theobjectiveoflearningcharacterlevelembeddingsprovidesanadditionalsourceofinformationduringtrainingforunseenandinfrequentwordsinthetrainingdata.ThisadditionaltrainingobjectiveleadstomoreaccuratemodelonthesupertaggingfortheCCGgrammar.Ourmethodimprovesperformanceonin-domainandout-of-domaindatasetsonbothsupertaggingandmulti-taggingtasks.Theexperimentalresultsshowthatthemodelisefficientwhilestillachievingbetterperformancesthansomestate-of-the-artmethods.-81- 哈尔滨工业大学工学博士学位论文结论Inthisthesis,weinvestigatedanddevelopeddifferenttechniquesandapproachesforsupertaggingappliedtotheCCGgrammar.Inparticular,wecarriedoutonapplyingdeeplearningbasedmethods;wehaveworkedontheEnglishCCGBankcorpusfortheCCGsupertaggingproblem.Themajorcontributionsofthisthesisaresummarizedbelow:(1)AdeeplearningmethodforCCGsupertaggingisproposed.WeproposedtheuseofGRUmodels.GRUscanmemorizeandrepresentinputsequencesforlongperiodsoftime.TheexperimentalresultsshowthatBGRUareefficienttotheproblemofsupertaggingfortheCCGgrammar,whilestillachievingbetterperformancesthanthecurrentstateoftheartmethods.Wealsointroducethemulti-taggingstrategytopredictCCGsupertags,whereoursupertaggercanselectmorethanoneCCGcategory.WefirstobtainWORDaccuracy,andthenSENTENCEaccuracy,andinbothcases,weobtainstate-of-the-art.(2)Next,anewapproachforCCGSupertaggingbasedonLSTMnetsispresented.TheproposedmethodisbasedonBLSTMs.ABackwardLSTMisintroducedtocombineinputlookuptables.Thenthecurrentdefactostandardinsequencelabelingtasks:BLSTMbasedneuralnetisusedandaSoftmaxactivationfunctionisusedtopredictthefinaloutputs.Wetestedtheefficiencyofourproposedmethodforbothsupertaggingandmulti-tagging.TheexperimentalresultsonthreedifferentdatasetsshowthattheBackward-BLSTMtechniqueisefficientforthetask.Theproposedmethodstillachievesbetterperformancesthanthecurrentstate-of-the-artmethods.(3)Inchapter4,weproposedasimpleandeffectiveCCGsupertaggingmethodbasedonthecombinationofBLSTMandCRFmodels.Comparedwithstate-of-the-artmodelsthecombinedmodelobtainedthestate-of-the-art.ThisisachievedbytakingadvantageofBLSTMmodels,andstrengtheningthepredictionlayerwiththeCRFmodelhavemoreadvantage.Evaluationsonin-domainandout-of-domaindatasetscomparingtostate-of-the-artsdemonstratetheeffectivenessofourproposedmethod.(4)AndinChapter5,weproposedanovelsequencelabelingframeworkforCCGsupertaggingwithasecondaryobjective-overcomeOOVwordsinthetrainingandout-of--82- 结论domaindatasets.OneBLSTMistrainedforwordinputs,whereasanotheroneistrainedforindividualcharactersofeachword.Atthesametime,bothofthosearecombined,inordertopredictthemostprobablelabelforeachword.Theobjectiveoflearningcharacterlevelembeddingsprovidesanadditionalsourceofinformationduringtrainingforunseenandinfrequentwordsinthetrainingdata.ThisadditionaltrainingobjectiveleadstomoreaccuratemodelontheCCGsupertaggingtask.Insummary,allthemodelsdescribedinthisdissertationareverysimpleandefficientforautomaticCCGsupertaggingofEnglishlanguagetextevenwiththepresenceofrareandunseenwords.Themodelshavemuchhigheraccuracythanthemachinelearningmodelsandarestate-of-the-art.EvenourproposedmodelsprovetheireffectivenessfortheCCGsupertaggingtask,thebasicproblemwithGRUsandLSTMsnetworksisthattheyuseinternalmemorywithgatingmechanismwhichallowsthememorytodeleteandupdateovertheinputsequence.PredictingCCGsupertagsbymakingmultiplecomputationsstepsoveraninputstorymaybeverybeneficialtoourtaskbyintegratingthepreviouslylearnedinformationfrommultiplesentencesasaglobalmemorywhichcanbedonewithend-to-endmemorynetworksviamultiplehopsoverthememory.Weleavethisasfuturework.Furtherworkinthisareacouldbedoneinseveraldirections.Someofthesecanbetakenupasimmediategoals,andotherscanbeconsideredaslong-termgoals.RegardingBLSTMbasedCCGsupertaggingmodels,therearesomepossibleextensionsthatmustbeentakenintoconsiderationasimmediategoals,andwethinkthattheyshouldbestudiedsuchasmultidimensionalLSTMs.WealsoplantoexploresomeotherdeeplearningalgorithmssuchasConvolutionalNeuralNetworks(CNNs)thathavebeenproventobeverybeneficialtocomposewordrepresentationsfromcharactersandtoencodecontextinformation[119].Moreover,theapplicationofreinforcementlearningwhichaimtoautomaticallydeterminetheidealbehaviorwithinaspecificcontext,tomaximizetheperformanceandself-trainingwouldbeveryadvantageous,tobuildmoreaccuratesupertaggers.Asonelong-termgoal,ItwouldbebeneficialandusefultoapplyoursupertaggerstosomeNLPtaskssuchasparsing,MT,andQAsystems.Also,itwouldbeinterestingtointegrateoursupertaggerswithsomeexistingparserssuchastheC&Cparserandtotestoursupertaggersforseverallanguageswithdifferentdatasets.-83- 哈尔滨工业大学工学博士学位论文参考文献[1]MarcusMP,MarcinkiewiczMA,SantoriniB.BuildingalargeannotatedcorpusofEnglish:ThePennTreebank[J].Computationallinguistics,1993,19(2):313–330.[2]NadeauD,SekineS.Asurveyofnamedentityrecognitionandclassification[J].LingvisticaeInvestigationes,2007,30(1):3–26.[3]PiskorskiJ,YangarberR.InformationExtraction:Past,PresentandFuture[J].Multi-source,MultilingualInformationExtractionandSummarization,2013:23–49.[4]BangaloreS.Complexityoflexicaldescriptionsanditsrelevancetopartialpars-ing[D].[S.l.]:UniversityofPennsylvania,1997:77–79.[5]SrinivasB."Almostparsing"techniqueforlanguagemodeling[C]//SpokenLan-guage,1996.ICSLP96.Proceedings.,FourthInternationalConferenceon:Vol2.1996:1173–1176.[6]ChandrasekarR,DoranC,SrinivasB.Motivationsandmethodsfortextsimplifica-tion[C]//Proceedingsofthe16thconferenceonComputationallinguistics-Volume2.1996:1041–1044.[7]BangaloreS,JoshiAK.Supertagging:Anapproachtoalmostparsing[J].Compu-tationallinguistics,1999,25(2):237–265.[8]MatsuzakiT,MiyaoY,TsujiiJ.ProbabilisticCFGwithlatentannotations[C]//Proceedingsofthe43rdAnnualMeetingoftheAssociationforComputationalLinguistics(ACL’05).2005:75–82.[9]ClarkS.Supertaggingforcombinatorycategorialgrammar[C]//ProceedingsoftheSixthInternationalWorkshoponTreeAdjoiningGrammarandRelatedFrameworks(TAG+6).2002:19–24.[10]SteedmanM,BaldridgeJ.Combinatorycategorialgrammar[J].Encyclopediaoflanguageandlinguistics,2006,2:610–622.[11]CollobertR,WestonJ,BottouL,etal.Naturallanguageprocessing(almost)fromscratch[J].JournalofMachineLearningResearch,2011,12(Aug):2493–2537.-84- 参考文献[12]DandapatS.Part-of-speechtaggingforBengali[D].[S.l.]:DepartmentofComputerScienceandEngineeringIndianInstituteofTechnology,KharagpurJanuary,2009:3–7.[13]JoshiAK,SrinivasB.Disambiguationofsuperpartsofspeech(orsupertags):Al-mostparsing[C]//Proceedingsofthe15thconferenceonComputationallinguistics-Volume1.1994:154–160.[14]ClarkS,CurranJR.Wide-coverageefficientstatisticalparsingwithCCGandlog-linearmodels[J].ComputationalLinguistics,2007,33(4):493–552.[15]AuliM.CCG-basedmodelsforstatisticalmachinetranslation[D].[S.l.]:Ph.D.Proposal,UniversityofEdinburgh,2009:11–15.[16]NadejdeM,ReddyS,SennrichR,etal.PredictingTargetLanguageCCGSupertagsImprovesNeuralMachineTranslation[C]//ProceedingsoftheSecondConferenceonMachineTranslation.2017:68–79.[17]ClarkS,SteedmanM,CurranJR.Object-extractionandquestion-parsingusingCCG[C]//Proceedingsofthe2004ConferenceonEmpiricalMethodsinNaturalLanguageProcessing.2004:111–118.[18]Bar-HillelY.Aquasi-arithmeticalnotationforsyntacticdescription[J].Language,1953,29(1):47–58.[19]SteedmanM.Categorialgrammar[J].TechnicalReports(CIS),1992:466.[20]AdesAE,SteedmanMJ.Ontheorderofwords[J].Linguisticsandphilosophy,1982,4(4):517–558.[21]NakornTN.CombinatoryCategorialGrammarParserinNaturalLanguageToolkit[J],2009:1–19.[22]ZhangY,ClarkS.Shift-reduceCCGparsing[C]//Proceedingsofthe49thAn-nualMeetingoftheAssociationforComputationalLinguistics:HumanLanguageTechnologies-Volume1.2011:683–692.[23]LewisM,SteedmanM.ImprovedCCGparsingwithsemi-supervisedsupertag-ging[J].TransactionsoftheAssociationforComputationalLinguistics,2014,2:327–338.[24]LewisM,HeL,ZettlemoyerL.Jointa*ccgparsingandsemanticrolelabelling[C]//Proceedingsofthe2015ConferenceonEmpiricalMethodsinNaturalLanguageProcessing.2015:1444–1454.-85- 哈尔滨工业大学工学博士学位论文[25]ZettlemoyerLS,CollinsM.OnlineLearningofRelaxedCCGGrammarsforParsingtoLogicalForm[J].EMNLP-CoNLL2007,2007:678.[26]BaralC,DzifcakJ,SonTC.Usinganswersetprogrammingandlambdacalcu-lustocharacterizenaturallanguagesentenceswithnormativesandexceptions[C]//Proceedingsofthe23rdnationalconferenceonArtificialintelligence-Volume2.2008:818–823.[27]BirchA,OsborneM,KoehnP.CCGsupertagsinfactoredstatisticalmachinetrans-lation[C]//ProceedingsoftheSecondWorkshoponStatisticalMachineTransla-tion.2007:9–16.[28]SteedmanM.Thesyntacticprocess[M].[S.l.]:TheMITPress,2000:31–34.[29]BangaloreS,JoshiAK.Supertagging:UsingComplexLexicalDescriptionsinNaturalLanguageProcessing[M].[S.l.]:TheMITPress,2010:219–354.[30]AmbatiBR,DeoskarT,SteedmanM.UsingCCGcategoriestoimproveHindide-pendencyparsing[C]//Proceedingsofthe51stAnnualMeetingoftheAssociationforComputationalLinguistics(Volume2:ShortPapers):Vol2.2013:604–609.[31]JinlongZ,XipengQ.CHINESECCGPARSINGBASEDONA*SEARCHANDSUPERTAGGING[J].ComputerApplicationsandSoftware,2014,9:059.[32]ChenJ,BangaloreS,CollinsM,etal.Rerankingann-gramsupertagger[C]//ProceedingsoftheSixthInternationalWorkshoponTreeAdjoiningGrammarandRelatedFrameworks(TAG+6).2002:259–268.[33]SrinivasB.Performanceevaluationofsupertaggingforpartialparsing[C]//ProceedingsoftheFifthInternationalWorkshoponParsingTechnologies.1997:187–198.[34]ChenJ.Towardsefficientstatisticalparsingusinglexicalizedgrammaticalinfor-mation[D].[S.l.]:UniversityofDelaware,2001:7–54.[35]RATNAPARKHIA.MaximumEntropyModelforPart-Of-SpeechTagging[J].Proc.EmpiricalMethodforNaturalLanguageProcessings,1996:133–142.[36]BrantsT.TnT:astatisticalpart-of-speechtagger[C]//ProceedingsofthesixthconferenceonAppliednaturallanguageprocessing.2000:224–231.[37]RatnaparkhiA.Maximumentropymodelsfornaturallanguageambiguityresolu-tion[J].PhDthesis.UniversityofPennsylvania,1998:32–36.-86- 参考文献[38]HockenmaierJ.Dataandmodelsforstatisticalparsingwithcombinatorycategorialgrammar[D].[S.l.]:UniversityofEdinburgh,2003:41–107.[39]HockenmaierJ,SteedmanM.CCGbank:acorpusofCCGderivationsanddepen-dencystructuresextractedfromthePennTreebank[J].ComputationalLinguistics,2007,33(3):355–396.[40]ClarkS,CurranJR.Theimportanceofsupertaggingforwide-coverageCCGparsing[C]//Proceedingsofthe20thinternationalconferenceonComputationalLinguistics:Vol282.2004.[41]TurianJ,RatinovL,BengioY.Wordrepresentations:asimpleandgeneralmethodforsemi-supervisedlearning[C]//Proceedingsofthe48thannualmeetingoftheassociationforcomputationallinguistics.2010:384–394.[42]CurranJR,ClarkS,VadasD.Multi-taggingforlexicalized-grammarparsing[C]//Proceedingsofthe21stInternationalConferenceonComputationalLinguisticsandthe44thannualmeetingoftheAssociationforComputationalLinguistics.2006:697–704.[43]RumelhartDE,HintonGE,WilliamsRJ.Learningrepresentationsbyback-propagatingerrors[J].nature,1986,323(6088):533.[44]ElmanJL.Findingstructureintime[J].Cognitivescience,1990,14(2):179–211.[45]XuW,AuliM,ClarkS.CCGsupertaggingwitharecurrentneuralnetwork[C]//Proceedingsofthe53rdAnnualMeetingoftheAssociationforComputationalLinguisticsandthe7thInternationalJointConferenceonNaturalLanguagePro-cessing(Volume2:ShortPapers):Vol2.2015:250–255.[46]SchusterM,PaliwalKK.Bidirectionalrecurrentneuralnetworks[J].IEEETrans-actionsonSignalProcessing,1997,45(11):2673–2681.[47]SchusterM.Onsupervisedlearningfromsequentialdatawithapplicationsforspeechrecognition[J].Daktarodisertacija,NaraInstituteofScienceandTechnol-ogy,1999:37–39.[48]BaldiP,BrunakS,FrasconiP,etal.Exploitingthepastandthefutureinproteinsecondarystructureprediction[J].Bioinformatics,1999,15(11):937–946.[49]XuW.LSTMshift-reduceCCGparsing[C]//Proceedingsofthe2016ConferenceonEmpiricalMethodsinNaturalLanguageProcessing.2016:1754–1764.-87- 哈尔滨工业大学工学博士学位论文[50]BengioY,SimardP,FrasconiP.Learninglong-termdependencieswithgradientdescentisdifficult[J].IEEEtransactionsonneuralnetworks,1994,5(2):157–166.[51]WaibelA,HanazawaT,HintonG,etal.Phonemerecognitionusingtime-delayneuralnetworks[C]//Readingsinspeechrecognition.1990:393–404.[52]LinT,HorneBG,TinoP,etal.Learninglong-termdependenciesinNARXrecurrentneuralnetworks[J].IEEETransactionsonNeuralNetworks,1996,7(6):1329–1338.[53]ElHihiS,BengioY.Hierarchicalrecurrentneuralnetworksforlong-termdepen-dencies[C]//Advancesinneuralinformationprocessingsystems.1996:493–499.[54]JaegerH,LukoševičiusM,PopoviciD,etal.Optimizationandapplicationsofechostatenetworkswithleaky-integratorneurons[J].Neuralnetworks,2007,20(3):335–352.[55]MartensJ,SutskeverI.Learningrecurrentneuralnetworkswithhessian-freeopti-mization[C]//Proceedingsofthe28thInternationalConferenceonMachineLearn-ing(ICML-11).2011:1033–1040.[56]PascanuR,MikolovT,BengioY.Onthedifficultyoftrainingrecurrentneuralnetworks[C]//InternationalConferenceonMachineLearning.2013:1310–1318.[57]HochreiterS,SchmidhuberJ.Longshort-termmemory[J].Neuralcomputation,1997,9(8):1735–1780.[58]LewisM,LeeK,ZettlemoyerL.Lstmccgparsing[C]//Proceedingsofthe2016ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:HumanLanguageTechnologies.2016:221–231.[59]VaswaniA,BiskY,SagaeK,etal.Supertaggingwithlstms[C]//Proceedingsofthe2016ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:HumanLanguageTechnologies.2016:232–237.[60]ChenJ,ShankerVK.AutomatedextractionofTAGsfromthePennTreebank[G]//Newdevelopmentsinparsingtechnology.[S.l.]:Springer,2004:73–89.[61]XiaF,PalmerM,JoshiA.Auniformmethodofgrammarextractionanditsapplications[C]//Proceedingsofthe2000JointSIGDATconferenceonEmpiricalmethodsinnaturallanguageprocessingandverylargecorpora:heldinconjunctionwiththe38thAnnualMeetingoftheAssociationforComputationalLinguistics-Volume13.2000:53–62.-88- 参考文献[62]BurkeM,LamO,CahillA,etal.Treebank-basedacquisitionofaChineselexical-functionalgrammar[C]//Proceedingsofthe18thPacificAsiaConferenceonLan-guage,InformationandComputation.2004:161–172.[63]MiyaoY,NinomiyaT,TsujiiJ.Corpus-orientedgrammardevelopmentforac-quiringahead-drivenphrasestructuregrammarfromthepenntreebank[C]//InternationalConferenceonNaturalLanguageProcessing.2004:684–693.[64]ChoK,vanMerrienboerB,GulcehreC,etal.LearningPhraseRepresentationsusingRNNEncoder–DecoderforStatisticalMachineTranslation[C]//Proceedingsofthe2014ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).2014:1724–1734.[65]HintonGE,OsinderoS,TehY-W.Afastlearningalgorithmfordeepbeliefnets[J].Neuralcomputation,2006,18(7):1527–1554.[66]BengioY,others.LearningdeeparchitecturesforAI[J].Foundationsandtrends®inMachineLearning,2009,2(1):1–127.[67]RanzatoM,SzummerM.Semi-supervisedlearningofcompactdocumentrepresen-tationswithdeepnetworks[C]//Proceedingsofthe25thinternationalconferenceonMachinelearning.2008:792–799.[68]CollobertR,WestonJ.Aunifiedarchitecturefornaturallanguageprocessing:Deepneuralnetworkswithmultitasklearning[C]//Proceedingsofthe25thinternationalconferenceonMachinelearning.2008:160–167.[69]MnihA,HintonGE.Ascalablehierarchicaldistributedlanguagemodel[C]//Advancesinneuralinformationprocessingsystems.2009:1081–1088.[70]WestonJ,RatleF,MobahiH,etal.Deeplearningviasemi-supervisedembed-ding[G]//NeuralNetworks:TricksoftheTrade.[S.l.]:Springer,2012:639–655.[71]ArelI,RoseDC,KarnowskiTP.Deepmachinelearning-anewfrontierinarti-ficialintelligenceresearch[researchfrontier][J].IEEEcomputationalintelligencemagazine,2010,5(4):13–18.[72]YuD,DengL.Deeplearninganditsapplicationstosignalandinformationprocess-ing[exploratorydsp][J].IEEESignalProcessingMagazine,2011,28(1):145–154.[73]HintonG,DengL,YuD,etal.Deepneuralnetworksforacousticmodelinginspeechrecognition:Thesharedviewsoffourresearchgroups[J].IEEESignalProcessingMagazine,2012,29(6):82–97.-89- 哈尔滨工业大学工学博士学位论文[74]BengioY,CourvilleA,VincentP.Representationlearning:Areviewandnewperspectives[J].IEEEtransactionsonpatternanalysisandmachineintelligence,2013,35(8):1798–1828.[75]HammerB.Ontheapproximationcapabilityofrecurrentneuralnetworks[J].Neu-rocomputing,2000,31(1-4):107–123.[76]JordanM.Attractordynamicsandparallelisminaconnectionistsequentialma-chine[C]//EighthAnnualConferenceoftheCognitiveScienceSociety,1986.1986:513–546.[77]LangKJ,WaibelAH,HintonGE.Atime-delayneuralnetworkarchitectureforisolatedwordrecognition[J].Neuralnetworks,1990,3(1):23–43.[78]JaegerH.The“echostate”approachtoanalysingandtrainingrecurrentneuralnetworks-withanerratumnote[J].Bonn,Germany:GermanNationalResearchCenterforInformationTechnologyGMDTechnicalReport,2001,148(34):13.[79]HochreiterS.UntersuchungenzudynamischenneuronalenNetzen[J].Diploma,TechnischeUniversitätMünchen,1991,91:1.[80]HochreiterS,BengioY,FrasconiP,etal.Gradientflowinrecurrentnets:thedifficultyoflearninglong-termdependencies.(2001)[J].Citedon,2001:114.[81]BengioY,DucharmeR,VincentP,etal.Aneuralprobabilisticlanguagemodel[J].Journalofmachinelearningresearch,2003,3(Feb):1137–1155.[82]CholletF.Keras:Theano-baseddeeplearninglibrary[J].Code:https://github.com/fchollet.Documentation:http://keras.io,2015.[83]MikolovT,SutskeverI,ChenK,etal.Distributedrepresentationsofwordsandphrasesandtheircompositionality[C]//Advancesinneuralinformationprocessingsystems.2013:3111–3119.[84]ZeilerMD.ADADELTA:AnAdaptiveLearningRateMethod[J].CoRR,2012,abs/1212.5701.[85]HintonGE,SrivastavaN,KrizhevskyA,etal.Improvingneuralnetworksbypreventingco-adaptationoffeaturedetectors[J].arXivpreprintarXiv:1207.0580,2012.[86]ChenJ,BangaloreS,CollinsM,etal.Rerankingann-gramsupertagger[C]//ProceedingsoftheSixthInternationalWorkshoponTreeAdjoiningGrammarandRelatedFrameworks(TAG+6).2002:259–268.-90- 参考文献[87]GersF.Longshort-termmemoryinrecurrentneuralnetworks[D].[S.l.]:Unpub-lishedPhDdissertation,EcolePolytechniqueFédéraledeLausanne,Lausanne,Switzerland,2001:15–20.[88]PetersM,AmmarW,BhagavatulaC,etal.Semi-supervisedsequencetaggingwithbidirectionallanguagemodels[C]//Proceedingsofthe55thAnnualMeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers):Vol1.2017:1756–1765.[89]PlankB,SøgaardA,GoldbergY.MultilingualPart-of-SpeechTaggingwithBidi-rectionalLongShort-TermMemoryModelsandAuxiliaryLoss[C]//Proceedingsofthe54thAnnualMeetingoftheAssociationforComputationalLinguistics(Vol-ume2:ShortPapers):Vol2.2016:412–418.[90]LimsopathamN,CollierN.BidirectionalLSTMforNamedEntityRecognitioninTwitterMessages[C]//Proceedingsofthe2ndWorkshoponNoisyUser-generatedText(WNUT).2016:145–152.[91]YanS,HardmeierC,NivreJ.MultilingualNamedEntityRecognitionusingHy-bridNeuralNetworks[C]//TheSixthSwedishLanguageTechnologyConference(SLTC).2016.[92]TangD,QinB,FengX,etal.EffectiveLSTMsforTarget-DependentSentimentClassification[C]//ProceedingsofCOLING2016,the26thInternationalConfer-enceonComputationalLinguistics:TechnicalPapers.2016:3298–3307.[93]WangY,HuangM,ZhaoL,etal.Attention-basedlstmforaspect-levelsentimentclassification[C]//Proceedingsofthe2016ConferenceonEmpiricalMethodsinNaturalLanguageProcessing.2016:606–615.[94]YangM,TuW,WangJ,etal.AttentionBasedLSTMforTargetDependentSenti-mentClassification.[C]//AAAI.2017:5013–5014.[95]GravesA,MohamedA-r,HintonG.Speechrecognitionwithdeeprecurrentneuralnetworks[C]//Acoustics,speechandsignalprocessing(icassp),2013ieeeinterna-tionalconferenceon.2013:6645–6649.[96]WangP,QianY,SoongFK,etal.Part-of-speechtaggingwithbidirectionallongshort-termmemoryrecurrentneuralnetwork[J].arXivpreprintarXiv:1510.06168,2015.-91- 哈尔滨工业大学工学博士学位论文[97]ChiuJP,NicholsE.NamedEntityRecognitionwithBidirectionalLSTM-CNNs[J].TransactionsoftheAssociationforComputationalLinguistics,2016,4:357–370.[98]GravesA,SchmidhuberJ.FramewisephonemeclassificationwithbidirectionalLSTMandotherneuralnetworkarchitectures[J].NeuralNetworks,2005,18(5-6):602–610.[99]LingW,DyerC,BlackAW,etal.Two/toosimpleadaptationsofword2vecforsyntaxproblems[C]//Proceedingsofthe2015ConferenceoftheNorthAmeri-canChapteroftheAssociationforComputationalLinguistics:HumanLanguageTechnologies.2015:1299–1304.[100]HonnibalM,NothmanJ,CurranJR.EvaluatingastatisticalCCGparseronWikipedia[C]//Proceedingsofthe2009WorkshoponThePeople’sWebMeetsNLP:CollaborativelyConstructedSemanticResources.2009:38–41.[101]PyysaloS,GinterF,HeimonenJ,etal.BioInfer:acorpusforinformationextractioninthebiomedicaldomain[J].BMCbioinformatics,2007,8(1):50.[102]PhamV,BlucheT,KermorvantC,etal.Dropoutimprovesrecurrentneuralnetworksforhandwritingrecognition[C]//FrontiersinHandwritingRecognition(ICFHR),201414thInternationalConferenceon.2014:285–290.[103]RimellL,ClarkS.Adaptingalexicalized-grammarparsertocontrastingdo-mains[C]//ProceedingsoftheConferenceonEmpiricalMethodsinNaturalLan-guageProcessing.2008:475–484.[104]CharniakE,CarrollG,AdcockJ,etal.Taggersforparsers[J].ArtificialIntelligence,1996,85(1-2):45–57.[105]LaffertyJ,McCallumA,PereiraFC.ConditionalRandomFields:ProbabilisticModelsforSegmentingandLabelingSequenceData[C]//Proceedingsofthe18thInternationalConferenceonMachineLearning:Vol951.2001:282–289.[106]McCallumA,FreitagD,PereiraFC.MaximumEntropyMarkovModelsforInformationExtractionandSegmentation.[C]//Icml:Vol17.2000:591–598.[107]PintoD,McCallumA,WeiX,etal.Tableextractionusingconditionalrandomfields[C]//Proceedingsofthe26thannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformaionretrieval.2003:235–242.-92- 参考文献[108]ShaF,PereiraF.Shallowparsingwithconditionalrandomfields[C]//Proceedingsofthe2003ConferenceoftheNorthAmericanChapteroftheAssociationforCom-putationalLinguisticsonHumanLanguageTechnology-Volume1.2003:134–141.[109]SuttonC,McCallumA.Anintroductiontoconditionalrandomfieldsforrelationallearning:Vol2[M].[S.l.]:Introductiontostatisticalrelationallearning.MITPress,2006:9–21.[110]HuangZ,XuW,YuK.BidirectionalLSTM-CRFmodelsforsequencetagging[J].arXivpreprintarXiv:1508.01991,2015.[111]KingmaDP,BaJ.Adam:Amethodforstochasticoptimization[J].arXivpreprintarXiv:1412.6980,2015.[112]BastienF,LamblinP,PascanuR,etal.Theano:newfeaturesandspeedimprove-ments[J].arXivpreprintarXiv:1211.5590,2012.[113]BlitzerJ,DredzeM,PereiraF.Biographies,bollywood,boom-boxesandblenders:Domainadaptationforsentimentclassification[C]//Proceedingsofthe45thannualmeetingoftheassociationofcomputationallinguistics.2007:440–447.[114]DauméIIIH,JagarlamudiJ.Domainadaptationformachinetranslationbyminingunseenwords[C]//Proceedingsofthe49thAnnualMeetingoftheAssociationforComputationalLinguistics:HumanLanguageTechnologies:shortpapers-Volume2.2011:407–412.[115]KimY.ConvolutionalNeuralNetworksforSentenceClassification[C]//Proceedingsofthe2014ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP).2014:1746–1751.[116]TangD,WeiF,YangN,etal.Learningsentiment-specificwordembeddingfortwittersentimentclassification[C]//Proceedingsofthe52ndAnnualMeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers):Vol1.2014:1555–1565.[117]JoshiA,TripathiV,PatelK,etal.AreWordEmbedding-basedFeaturesUsefulforSarcasmDetection?[C]//Proceedingsofthe2016ConferenceonEmpiricalMethodsinNaturalLanguageProcessing.2016:1006–1011.-93- 哈尔滨工业大学工学博士学位论文[118]LampleG,BallesterosM,SubramanianS,etal.NeuralArchitecturesforNamedEntityRecognition[C]//ProceedingsofNAACL-HLT.2016:260–270.[119]YuX,FalenskaA,VuNT.AGeneral-PurposeTaggerwithConvolutionalNeuralNetworks[C]//ProceedingsoftheFirstWorkshoponSubwordandCharacterLevelModelsinNLP.2017:124–129.-94- 攻读博士学位期间发表的论文及其他成果攻读博士学位期间发表的论文及其他成果(一)发表的学术论文[1]RekiaKadari,YuZhang,WeinanZhang,TingLiu.CCGsu-pertaggingwithbidirectionallongshort-termmemorynetworks[J].NaturalLanguageEngineering,2018,24(1):77-90.[Online].Available:https://www.cambridge.org/core/journals/natural-language-engineering/article/ccg-supertagging-with-bidirectional-long-shortterm-memory-networks/8C06FF6F717744B29C9BD330CABACD16.(SCI,IF=1.065).[2]RekiaKadari,YuZhang,WeinanZhang,TingLiu.CCGSu-pertaggingviaBidirectionalLSTM-CRFNeuralArchitecture.[J].Neurocomputing,2018,283:31-37.[Online].Available:https://www.sciencedirect.com/science/article/pii/S0925231217319124.(SCI,IF=3.317).[3]RekiaKadari,YuZhang,WeinanZhang,TingLiu.GatedRecurrentUnitmodelforaSequenceTaggingproblem.[J].HighTechnologyLetters,2018.(EI-index,Accepted).-95- 致谢致谢MyheartfeltgratitudegoestotheAlmightyGod,ALLAHforthewisdom,knowl-edge,abilityandthestrengthgivenmefromthebeginningofmystudiestothecompletionofthiswork.Secondly,IamgratefultomysupervisorProf.LiuTingforgivingmetheopportunitytojointheSCIRlaboratory.HisguidanceandmotivationinspiredmethroughtheentiredurationofmyPh.D.studies.IamindebtedtomyassociatesupervisorProf.ZhangYuforhispatience,advice,andsupervisionwithcontinuoussupportandhelpfuldiscussionsthroughoutallthework.Ihavegreatlybenefitedfromhisideasandrecommendations.IamforevergratefulSIR!Ioweaparticulardebttomyparentsfortheirpatience,support,andencouragementsduringthisresearchandallmylife.Myheartythanksalsogotoallmyfamilymemberswhoencouragedmeandprayedformethroughoutthetimeofmyresearch.IalsoextendmyappreciationtoProf.QinBing,Prof.Chewanxiang,Zhangwei-nanandZhaoYanyanfortheirguidanceandhelp.SpecialthanksgotoLiuYijia,GuoMaosheng,QingyuYin,WangXuxiang,JiangGuo,WangBinghao,QiLeandallmylab-matesfromtheSCIRlaboratory,especiallyQAgroup.Thankyouverymuch.TomyfriendLydia,thankyouforlistening,offeringmeadvice,andsupportingmethroughthisentireprocess.Ithankallwhoinonewayoranothercontributedinthecompletionofthisthesis.IwouldliketodedicatethisworktomymotherMrs.BelgacemKheirawhosedreamsformehaveresultedinthisachievementandwithoutherlove,support,andblessings;IwouldnothavebeenwhereIamtodayandwhatIamtoday.Youhavealwaysbeenpresentforme,youaremyBestfriend.Thankyouwithallmyheart.Thisoneisforyoumom!RekiaKADARI-97- 哈尔滨工业大学工学博士学位论文个人简历•Name:RekiaKadari•Nationality:Algerian•Languages:English&Arabic&French•DateofBirth:27-May-1990•Sex:Female•Maritalstatus:Single•PresentAddress:HarbinInstituteofTechnology,Harbin,Heilongjiang,150001•Telephone:15776462745•Email:rekia@ir.hit.edu.cnProfessionalqualifications:•2014-2018:(Ph.D.inSocialComputingandInformationRetrievallaboratory),GraduateStudent,SchoolofComputerScienceandTechnology,HarbinInstituteofTechnology,Harbin,China.•2011-2013:(M.Sc.inComputerScience),M.Sc.inComputerScienceFacultyofScienceandTechnology,June2013,UniversityDr.TaharMoulay,Saida,Algeria.•2008-2011:(B.Sc.inComputerScience),b.Sc.inComputerScienceFacultyofScienceandTechnology,June2011,UniversityDr.TaharMoulay,Saida,Algeria.Subjectstaught:1.NaturalLanguageProcessing2.ArtificialIntelligence3.MachineLearning4.DeepLearning5.Sequencelabeling6.CCGsupertagging-98-