欢迎来到天天文库
浏览记录
ID:18568564
大小:1.61 MB
页数:22页
时间:2018-09-18
《Paraphrase Acquisition via Crowdsourcing and Machine Learning基于众包和机器学习的释义获取》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、ParaphraseAcquisitionviaCrowdsourcingandMachineLearningSTEVENBURROWSandMARTINPOTTHASTandBENNOSTEINWebTechnologyandInformationSystems,Bauhaus-UniversitätWeimarToparaphrasemeanstorewritecontentwhilstpreservingtheoriginalmeaning.Paraphrasingisimportantinfieldssuchastextreuseinjournali
2、sm,anonymisingwork,andimprovingthequalityofcustomer-writtenre-views.Thispapercontributestoparaphraseacquisitionandfocusesontwoaspectsthatarenotaddressedbycurrentresearch:(1)acquisitionviacrowdsourcing,and(2)acquisitionofpassage-levelsamples.Thechallengeofthefirstaspectisautomaticqua
3、lityassurance;withoutsuchameansthecrowdsourcingparadigmisnotef-fective,andwithoutcrowdsourcingthecreationoftestcorporaisunacceptablyexpensiveforrealisticorderofmagnitudes.Thesecondaspectaddressesthedeficitthatmostofthepreviousworkingeneratingandevaluat-ingparaphraseshasbeenconducted
4、usingsentence-levelparaphrasesorshorter;theseshort-sampleanalysesarelimitedintermsofapplicationtoplagiarismdetection,forexample.WepresenttheWebisCrowdParaphraseCorpus2011(Webis-CPC-11),whichrecentlyformedpartofthePAN2010internationalplagiarismdetectioncompetition.Thiscorpuscomprise
5、spassage-levelparaphraseswith4067positivesamplesand3792negativesamplesthatfailedourcriteria,usingAmazon’sMechanicalTurkforcrowdsourcing.Inthispaper,wereviewthelessonslearnedatPAN2010,andexplainindetailthemethodusedtoconstructthecorpus.Theempiricalcontributionsincludemachinelearning
6、experimentstoexploreifpassage-levelparaphrasescanbeidentifiedinatwo-classclassificationproblemusingparaphrasesimilarityfeatures,andwefindthatak-nearest-neighborclas-sifiercancorrectlydistinguishbetweenparaphrasedandnon-paraphrasedsampleswith0.980precisionat0.523recall.Thisresultimplies
7、thatjustunderhalfofoursamplesmustbediscarded(remaining0.477fraction),butourcost-analysisshowsthattheautomationweintroduceresultsina18%financialsavingandover100hoursoftimereturnedtotheresearcherswhenrepeatingasimilarcorpusdesign.Ontheotherhand,whenbuildinganunrelatedcorpusrequiringsa
8、y25%trainingdatafortheauto
此文档下载收益归作者所有