资源描述:
《Text classification from labeled and unlabeled documents using EM》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、MachineLearning,,1{34()cKluwerAcademicPublishers,Boston.ManufacturedinTheNetherlands.TextClassicationfromLabeledandUnlabeledDocumentsusingEMyKAMALNIGAMknigam@cs.cmu.eduzyANDREWKACHITESMCCALLUMmccallum@justresearch.comySEBASTIANTHRUNthrun@cs.cmu.eduyTOMMITCHELLtom.mitchell@cmu.eduySchoolofComputerSc
2、ience,CarnegieMellonUniversity,Pittsburgh,PA15213zJustResearch,4616HenryStreet,Pittsburgh,PA15213ReceivedMarch15,1998;RevisedFebruary20,1999Editor:WilliamW.CohenAbstract.Thispapershowsthattheaccuracyoflearnedtextclassierscanbeimprovedbyaugmentingasmallnumberoflabeledtrainingdocumentswithalargepoolo
3、funlabeleddocu-ments.Thisisimportantbecauseinmanytextclassicationproblemsobtainingtraininglabelsisexpensive,whilelargequantitiesofunlabeleddocumentsarereadilyavailable.WeintroduceanalgorithmforlearningfromlabeledandunlabeleddocumentsbasedonthecombinationofExpectation-Maximization(EM)andanaiveBayesc
4、lassier.Thealgorithmrsttrainsaclassierusingtheavailablelabeleddocuments,andprobabilisticallylabelstheunlabeleddocuments.Itthentrainsanewclassierusingthelabelsforallthedocuments,anditeratestoconvergence.ThisbasicEMprocedureworkswellwhenthedataconformtothegenerativeassumptionsofthemodel.Howeverthe
5、seassumptionsareoftenviolatedinpractice,andpoorperformancecanresult.Wepresenttwoextensionstothealgorithmthatimproveclassicationaccuracyundertheseconditions:(1)aweightingfactortomodulatethecontributionoftheunlabeleddata,and(2)theuseofmultiplemixturecomponentsperclass.Experimentalresults,obtainedusin
6、gtextfromthreedierentreal-worldtasks,showthattheuseofunlabeleddatareducesclassicationerrorbyupto30%.Keywords:textclassication,Expectation-Maximization,integratingsupervisedandunsuper-visedlearning,combininglabeledandunlabeleddata,Bayesianlearning1.IntroductionConsidertheproblemofautomaticallyclas
7、sifyingtextdocuments.Thisproblemisofgreatpracticalimportancegiventhemassivevolumeofonlinetextavail-ablethroughtheWorldWideWeb,Internetnewsfeeds,electronicmail,corporatedatabases,medicalpatientrecordsa