资源描述:
《2008-PHDTHESIS-Structured topic models for language英文学习材料》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、StructuredTopicModelsforLanguageHannaM.WallachB.A.,UniversityofCambridge(2001);M.Sc.,UniversityofEdinburgh(2002)NewnhamCollegeUniversityofCambridgeTHESISSubmittedforthedegreeofDoctorofPhilosophy,UniversityofCambridge20083AbstractThisthesisintroducesnew
2、methodsforstatisticallymodellingtextusingtopicmod-els.Topicmodelshaveseenmanysuccessesinrecentyears,andareusedinavarietyofapplications,includinganalysisofnewsarticles,topic-basedsearchinterfacesandnavigationtoolsfordigitallibraries.Despitetheserecentsu
3、ccesses,thefieldoftopicmodellingisstillrelativelynewandthereremainsmuchtobeexplored.Onenotice-ableabsencefrommostofthepreviousworkontopicmodellingisconsiderationoflanguageanddocumentstructure—fromlow-levelstructures,includingwordorderandsyntax,tohigher-
4、levelstructures,suchasrelationshipsbetweendocuments.Thefocusofthisthesisisthereforestructuredtopicmodels—modelsthatcombinelatenttopicswithinformationaboutdocumentstructure,rangingfromlocalsen-tencestructuretointer-documentrelationships.Thesemodelsdrawo
5、ntechniquesfromBayesianstatistics,includinghierarchicalDirichletdistributionsandprocesses,Pitman-Yorprocesses,andMarkovchainMonteCarlomethods.SeveralmethodsforestimatingtheparametersofDirichlet-multinomialdistributionsarealsocompared.Themaincontributio
6、nofthisthesisistheintroductionofthreestructuredtopicmod-els.Thefirstisatopic-basedlanguagemodel.ThismodelcapturesbothwordorderandlatenttopicsbyextendingaBayesiantopicmodeltoincorporaten-gramstatistics.Abigramversionofthenewmodeldoesbetteratpredictingfut
7、urewordsthaneitheratopicmodeloratrigramlanguagemodel.Italsoprovidesinterpretabletopics.ThesecondmodelarisesfromaBayesianreinterpretationofaclassicgenerativede-pendencyparsingmodel.Thenewmodeldemonstratesthatparsingperformancecanbesubstantiallyimprovedb
8、yacarefulchoiceofpriorandbysamplinghyperparame-ters.Additionally,thegenerativenatureofthemodelfacilitatestheinclusionoflatentstatevariables,whichactasspecialisedpart-of-speechtagsor“syntactictopics”.Thethirdisamodelthatcaptureshigh-leve