资源描述:
《LDA(for topic modeling)》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、JournalofMachineLearningResearch3(2003)993-1022Submitted2/02;Published1/03LatentDirichletAllocationDavidM.BleiBLEI@CS.BERKELEY.EDUComputerScienceDivisionUniversityofCaliforniaBerkeley,CA94720,USAAndrewY.NgANG@CS.STANFORD.EDUComputerScienceDepartmentStanfordUnive
2、rsityStanford,CA94305,USAMichaelI.JordanJORDAN@CS.BERKELEY.EDUComputerScienceDivisionandDepartmentofStatisticsUniversityofCaliforniaBerkeley,CA94720,USAEditor:JohnLaffertyAbstractWedescribelatentDirichletallocation(LDA),agenerativeprobabilisticmodelforcollection
3、sofdiscretedatasuchastextcorpora.LDAisathree-levelhierarchicalBayesianmodel,inwhicheachitemofacollectionismodeledasafinitemixtureoveranunderlyingsetoftopics.Eachtopicis,inturn,modeledasaninfinitemixtureoveranunderlyingsetoftopicprobabilities.Inthecontextoftextmode
4、ling,thetopicprobabilitiesprovideanexplicitrepresentationofadocument.WepresentefficientapproximateinferencetechniquesbasedonvariationalmethodsandanEMalgorithmforempiricalBayesparameterestimation.Wereportresultsindocumentmodeling,textclassification,andcollaborative
5、filtering,comparingtoamixtureofunigramsmodelandtheprobabilisticLSImodel.1.IntroductionInthispaperweconsidertheproblemofmodelingtextcorporaandothercollectionsofdiscretedata.Thegoalistofindshortdescriptionsofthemembersofacollectionthatenableefficientprocessingoflarge
6、collectionswhilepreservingtheessentialstatisticalrelationshipsthatareusefulforbasictaskssuchasclassification,noveltydetection,summarization,andsimilarityandrelevancejudgments.Significantprogresshasbeenmadeonthisproblembyresearchersinthefieldofinforma-tionretrieval(
7、IR)(Baeza-YatesandRibeiro-Neto,1999).ThebasicmethodologyproposedbyIRresearchersfortextcorpora—amethodologysuccessfullydeployedinmodernInternetsearchengines—reduceseachdocumentinthecorpustoavectorofrealnumbers,eachofwhichrepre-sentsratiosofcounts.Inthepopulartf-i
8、dfscheme(SaltonandMcGill,1983),abasicvocabularyof“words”or“terms”ischosen,and,foreachdocumentinthecorpus,acountisformedofthenumberofoccurrencesofeachword.Aftersuitabl