Organizing encyclopedic knowledge based on the Web and its application to question answerin

Organizing encyclopedic knowledge based on the Web and its application to question answerin

ID:37658734

大小:58.32 KB

页数:8页

时间:2019-05-27

上传者:U-2595
Organizing encyclopedic knowledge based on the Web and its application to question answerin_第1页
Organizing encyclopedic knowledge based on the Web and its application to question answerin_第2页
Organizing encyclopedic knowledge based on the Web and its application to question answerin_第3页
Organizing encyclopedic knowledge based on the Web and its application to question answerin_第4页
Organizing encyclopedic knowledge based on the Web and its application to question answerin_第5页
资源描述:

《Organizing encyclopedic knowledge based on the Web and its application to question answerin》由会员上传分享,免费在线阅读,更多相关内容在行业资料-天天文库

OrganizingEncyclopedicKnowledgebasedontheWebanditsApplicationtoQuestionAnsweringAtsushiFujiiTetsuyaIshikawaUniversityofLibraryandUniversityofLibraryandInformationScienceInformationScience1-2Kasuga,Tsukuba1-2Kasuga,Tsukuba305-8550,Japan305-8550,JapanCREST,JapanScienceandishikawa@ulis.ac.jpTechnologyCorporationfujii@ulis.ac.jpAbstractOntheonehand,theirmethodisexpectedtoen-hanceexistingencyclopedias,wherevocabularysizeWeproposeamethodtogeneratelarge-scaleisrelativelylimited,andthereforethequantityprob-encyclopedicknowledge,whichisvaluablelemshasbeenresolved.formuchNLPresearch,basedontheWeb.Ontheotherhand,encyclopediasextractedfromtheWefirstsearchtheWebforpagescontain-Webarenotcomparablewithexistingonesintermsofingaterminquestion.Thenweuselin-quality.Inhand-craftedencyclopedias,termdescrip-guisticpatternsandHTMLstructurestoex-tionsarecarefullyorganizedbasedondomainsandtracttextfragmentsdescribingtheterm.Fi-wordsenses,whichareespeciallyeffectiveforhumannally,weorganizeextractedtermdescrip-usage.However,theoutputofFujii'smethodissimplytionsbasedonwordsensesanddomains.Inasetofunorganizedtermdescriptions.Althoughclus-addition,weapplyanautomaticallygener-teringisoptionallyperformed,resultantclustersareatedencyclopediatoaquestionansweringnotnecessarilyrelatedtoexplicitcriteria,suchaswordsystemtargetingtheJapaneseInformation-sensesanddomains.TechnologyEngineersExamination.Tosumup,ourbeliefisthatbycombiningextrac-tionandorganizationmethods,wecanenhanceboth1IntroductionquantityandqualityofWeb-basedencyclopedias.Motivatedbythisbackground,weintroduceanor-ReflectingthegrowthinutilizationoftheWorldWideWeb,anumberofWeb-basedlanguageprocessingganizationmodeltoFujii'smethodandreformalizethewholeframework.Inotherwords,ourproposedmethodshavebeenproposedwithinthenaturallan-guageprocessing(NLP),informationretrieval(IR)methodisnotonlyextractionbutgenerationofency-clopedicknowledge.andartificialintelligence(AI)communities.Asam-pleoftheseincludesmethodstoextractlinguisticSection2explainstheoveralldesignofourency-resources(FujiiandIshikawa,2000;Resnik,1999;clopediagenerationsystem,andSection3elaboratesSoderland,1997),retrieveusefulinformationinre-onourorganizationmodel.Section4thenexploressponsetouserqueries(Etzioni,1997;McCallumetamethodforapplyingourresultantencyclopediatoal.,1999)andmine/discoverknowledgelatentintheNLPresearch,specifically,questionanswering.Sec-Web(Inokuchietal.,1999).tion5performsanumberofexperimentstoevaluateInthispaper,mainlyfromanNLPpointofview,ourmethods.weexploreamethodtoproducelinguisticresources.Specifically,weenhancethemethodproposedbyFu-2SystemDesignjiiandIshikawa(2000),whichextractsencyclopedic2.1Overviewknowledge(i.e.,termdescriptions)fromtheWeb.Inbrief,theirmethodsearchestheWebforpagesFigure1depictstheoveralldesignofoursystem,containingaterminquestion,anduseslinguisticex-whichgeneratesanencyclopediaforinputterms.pressionsandHTMLlayoutstoextractfragmentsde-Oursystem,whichiscurrentlyimplementedforscribingtheterm.TheyalsousealanguagemodeltoJapanese,consistsofthreemodules:“retrieval,”“ex-discardnon-linguisticfragments.Inaddition,aclus-traction”and“organization,”amongwhichtheorga-teringmethodisusedtodividedescriptionsintoaspe-nizationmoduleisnewlyintroducedinthispaper.Incificnumberofgroups.principle,theremainingtwomodules(“retrieval”and “extraction”)arethesameasproposedbyFujiiandThefirstruleisbasedonJapaneselinguisticpatternsIshikawa(2000).typicallyusedfortermdescriptions,suchas“XtohaInFigure1,termscanbesubmittedeitheron-lineorYdearu(XisY).”Followingthemethodproposedoff-line.AreasonablemethodisthatwhilethesystembyFujiiandIshikawa(2000),wesemi-automaticallyperiodicallyupdatestheencyclopediaoff-line,termsproduced20patternsbasedontheJapaneseCD-ROMunindexedintheencyclopediaaredynamicallypro-WorldEncyclopedia(Heibonsha,1998),whichin-cessedinreal-timeusage.Ineithercase,oursystemcludesapproximately80,000entriesrelatedtovariousprocessesinputtermsonebyone.fields.Itisexpectedthataregionincludingthesen-Webrieflyexplaineachmoduleinthefollowingtencethatmatchedwithoneofthosepatternscanbeathreesections,respectively.termdescription.ThesecondruleisbasedonHTMLlayout.Inatyp-term(s)icalcase,aterminquestionishighlightedasaheadingwithtagssuchas

,and(“x”denotesretrievaladigit),followedbyitsdescription.Insomecases,Webtermsaremarkedwiththeanchortag,providinghyperlinkstopageswheretheyaredescribed.extractionextractionFinally,basedontheregionbrieflyidentifiedbytherulesabovemethod,weextractapagefragmentasatermdescription.Sincetermdescriptionsusuallyconsistofdomainalogicalsegment(suchasaparagraph)ratherthanamodelsinglesentence,weextractafragmentthatmatchedorganizationwithoneofthefollowingpatterns,whicharesorteddescriptionaccordingtopreferenceindescendingorder:model1.descriptiontaggedwith
inthecasewhereencyclopediathetermistaggedwith
2,2.paragraphtaggedwith

,Figure1:TheoveralldesignofourWeb-basedency-clopediagenerationsystem.3.itemizationtaggedwith

    ,4.Nsentences,whereweempiricallysetN=3.2.2Retrieval2.4OrganizationTheretrievalmodulesearchestheWebforpagescon-taininganinputterm,forwhichexistingWebsearchAsdiscussedinSection1,organizinginformationex-enginescanbeused,andthosewithbroadcoveragetractedfromtheWebiscrucialinourframework.Foraredesirable.thispurpose,weclassifyextractedtermdescriptionsHowever,searchenginesperformingqueryexpan-basedonwordsensesanddomains.sionarenotalwaysdesirable,becausetheyusuallyre-Althoughanumberofmethodshavebeenproposedtrieveanumberofpageswhichdonotcontainanin-togeneratewordsenses(forexample,onebasedontheputkeyword.Sincetheextractionmodule(seeSec-vectorspacemodel(Sch¨utze,1998)),itisstilldifficulttion2.3)analyzestheusageoftheinputterminre-toaccuratelyidentifywordsenseswithoutexplicitdic-trievedpages,pagesnotcontainingthetermareofnotionariesthatdefinesensecandidates.useforourpurpose.Inaddition,sincewordsensesareoftenassociatedThus,weuseastheretrievalmodule“Google,”withdomains(Yarowsky,1995),wordsensescanbewhichisoneofthemajorsearchenginesanddoesnotconsequentlydistinguishedbywayofdeterminingtheconductqueryexpansion1.domainofeachdescription.Forexample,differentsensesfor“pipeline(processingmethod/transportation2.3Extractionpipe)”areassociatedwiththecomputerandconstruc-Intheextractionmodule,givenWebpagescontainingtiondomains(fields),respectively.aninputterm,newlinecodes,redundantwhitespacesTosumup,theorganizationmoduleclassifiestermandHTMLtagsthatarenotusedinthefollowingpro-descriptionsbasedondomains,forwhichweusedo-cessesarediscardedtostandardizethepageformat.mainanddescriptionmodels.InSection3,weelabo-Second,weapproximatelyidentifyaregiondescrib-rateonourorganizationmodel.ingtheterminthepage,forwhichtworulesareused.2
    and
    areinherentlyprovidedtodescribe1http://www.google.com/termsinHTML. 3StatisticalOrganizationModel3.2DomainModelThedomainmodelquantifiestheextenttowhichde-3.1Overviewscriptiondisassociatedwithdomainc,whichisfun-Givenoneormore(inmostcasesmorethanone)damentallyacategorizationtask.Amonganumberdescriptionsforasingleinputterm,theorganizationofexistingcategorizationmethods,weexperimentallymoduleselectsappropriatedescription(s)foreachdo-usedoneproposedbyIwayamaandTokunaga(1994),mainrelatedtotheterm.whichformulatesP(c|d)asinEquation(2).Wedonotneedalltheextracteddescriptionsasfi-XP(t|c)·P(t|d)naloutputs,becausetheyareusuallysimilartooneP(c|d)=P(c)·(2)P(t)another,andthusareredundant.tForthemoment,weassumethatweknowaprioriHere,P(t|d),P(t|c)andP(t)denoteprobabilitieswhichdomainsarerelatedtotheinputterm.thatwordtappearsind,candallthedomains,respec-Fromtheviewpointofprobabilitytheory,ourtasktively.WeregardP(c)asaconstant.WhileP(t|d)ishereistoselectdescriptionswithgreaterprobabilitysimplyarelativefrequencyoftind,weneedprede-forgivendomains.TheprobabilityfordescriptiondfineddomainstocomputeP(t|c)andP(t).Forthisgivendomainc,P(d|c),iscommonlytransformedaspurpose,theuseoflarge-scalecorporaannotatedwithinEquation(1),throughuseoftheBayesiantheorem.domainsisdesirable.However,sincethoseresourcesareprohibitivelyP(c|d)·P(d)P(d|c)=(1)expensive,weusedthe“Nova”dictionaryforP(c)Japanese/Englishmachinetranslationsystems3,whichincludesapproximatelyonemillionentriesrelatedtoInpractice,P(c)canbeomittedbecausethisfactoris19technicalfieldsaslistedbelow:aconstant,andthusdoesnotaffecttherelativeproba-bilityfordifferentdescriptions.aeronautics,biotechnology,business,chem-InEquation(1),P(c|d)modelsaprobabilitythatdistry,computers,construction,defense,correspondstodomainc.P(d)modelsaprobabilityecology,electricity,energy,finance,law,thatdcanbeadescriptionfortheterminquestion,mathematics,mechanics,medicine,metals,disregardingthedomain.Weshallcallthemdomainoceanography,plants,trade.anddescriptionmodels,respectively.Weextractedwordsfromdictionaryentriestoesti-Tosumup,inprincipleweselectd'sthataremateP(t|c)andP(t),whicharerelativefrequenciesstronglyassociatedwithaspecificdomain,andareoftincandallthedomains,respectively.Weusedlikelytobedescriptionsthemselves.theChaSenmorphologicalanalyzer(Matsumotoetal.,Extracteddescriptionsarenotlinguisticallyunder-1997)toextractwordsfromJapaneseentries.WealsostandableinthecasewheretheextractionprocessisusedEnglishentriesbecauseJapanesedescriptionsof-unsuccessfulandretrievedpagesinherentlycontaintencontainEnglishwords.non-linguisticinformation(suchasspecialcharactersItmaybearguedthatstatisticsextractedfromdic-ande-mailaddresses).tionariesareunreliable,becausewordfrequenciesinToresolvethisproblem,FujiiandIshikawa(2000)realwordusagearemissing.However,wordsthatareusedalanguagemodeltofilteroutdescriptionswithrepresentativeforadomaintendtobefrequentlyusedlowperplexity.However,inthispaperweintegratedincompoundwordentriesassociatedwiththedomain,adescriptionmodel,whichispracticallythesameasandthusourmethodisapracticalapproximation.alanguagemodel,withanorganizationmodel.Thenewframeworkismoreunderstandablewithrespect3.3DescriptionModeltoprobabilitytheory.ThedescriptionmodelquantifiestheextenttowhichaInpractice,wefirstuseEquation(1)tocomputegivenpagefragmentisfeasibleasadescriptionfortheP(d|c)forallthec'spredefinedinthedomainmodel.inputterm.Inprinciple,wedecomposethedescriptionThenwediscardsuchc'swhoseP(d|c)isbelowaspe-modelintolanguageandqualityproperties,asshowncificthreshold.Asaresult,fortheinputterm,relatedinEquation(3).domainsanddescriptionsaresimultaneouslyselected.Thus,wedonothavetoknowaprioriwhichdomainsP(d)=PL(d)·PQ(d)(3)arerelatedtoeachterm.Here,PL(d)andPQ(d)denotelanguageandqualityInthefollowingtwosections,weexplainmethodsmodels,respectively.torealizethedomainanddescriptionmodels,respec-3tively.ProducedbyNOVA,Inc. Itisexpectedthatthequalitymodeldiscardsin-Amongtheaboveapplications,naturallanguageun-correctormisleadinginformationcontainedinWebderstanding(NLU)isthemostchallengingfromasci-pages.Forthispurpose,anumberofqualityratingentificpointofview.CurrentpracticalNLUresearchmethodsforWebpages(Amentoetal.,2000;Zhuandincludesdialogue,informationextractionandquestionGauch,2000)canbeused.answering,amongwhichwefocussolelyonquestionHowever,sinceGoogle(i.e.,thesearchengineusedanswering(QA)inthispaper.inoursystem)ratesthequalityofpagesbasedonAstraightforwardapplicationistoanswerinter-hyperlinkinformation,andselectivelyretrievesthoserogativequestionslike“WhatisX?”inwhichaQAwithhigherquality(BrinandPage,1998),wetenta-systemsearchestheencyclopediadatabaseforoneortivelyregardedPQ(d)asaconstant.Thus,inpracticemoredescriptionsrelatedtoX(thisapplicationisalsothedescriptionmodelisapproximatedsolelywiththeeffectivefordialogsystems).languagemodelasinEquation(4).Ingeneral,theperformanceofQAsystemsareeval-uatedbasedoncoverageandaccuracy.CoverageisP(d)≈PL(d)(4)theratiobetweenthenumberofquestionsanswered(disregardingtheircorrectness)andthetotalnumberStatisticalapproachestolanguagemodelinghaveofquestions.Accuracyistheratiobetweenthenum-beenusedinmuchNLPresearch,suchasmachineberofcorrectanswersandthetotalnumberofanswerstranslation(Brownetal.,1993)andspeechrecogni-madebythesystem.tion(Bahletal.,1983).OurmodelisalmostthesameWhilecoveragecanbeestimatedobjectivelyandasexistingmodels,butisdifferentintworespects.systematically,estimatingaccuracyreliesonhumanFirst,whilegenerallanguagemodelsquantifythesubjects(becausethereisnoabsolutedescriptionforextenttowhichagivenwordsequenceislinguisti-termX),andthusisexpensive.callyacceptable,ourmodelalsoquantifiestheextentInviewofthisproblem,wetargetedInformationtowhichtheinputisacceptableasatermdescription.TechnologyEngineersExaminations4,whicharebian-Thus,wetrainedthemodelbasedonanexistingma-nual(springandautumn)examinationsnecessaryforchinereadableencyclopedia.candidatestoqualifytobeITengineersinJapan.WeusedtheChaSenmorphologicalanalyzertoAmonganumberofclasses,wefocusedonthesegmenttheJapaneseCD-ROMWorldEncyclope-“ClassII”examination,whichrequiresfundamentaldia(Heibonsha,1998)intowords(wereplacedhead-andgeneralknowledgerelatedtoinformationtechnol-wordswithacommonsymbol),andthenusedtheogy.ApproximatelyhalfofquestionsareassociatedCMU-Cambridgetoolkit(ClarksonandRosenfeld,withITtechnicalterms.1997)tomodelaword-basedtrigram.SincepastexaminationsandanswersareopentotheConsequently,descriptionsinwhichwordse-public,wecanevaluatetheperformanceofourQAquencesaremoresimilartothoseintheWorldEn-systemwithminimalcost.cyclopediaareassignedgreaterprobabilityscoresthroughourlanguagemodel.4.2AnalyzingITEngineersExaminationsSecond,P(d),whichisaproductofprobabilitiesTheClassIIexaminationconsistsofquadruple-choiceforN-gramsind,isquitesensitivetothelengthofd.questions,amongwhichtechnicaltermquestionscanInthecasesofmachinetranslationandspeechrecog-besubdividedintotwotypes.nition,thisproblemislesscrucialbecausemultipleInthefirsttypeofquestion,examineeschoosecandidatescomparedbasedonthelanguagemodelarethemostappropriatedescriptionforagiventechnicalalmostequivalentintermsoflength.term,suchas“memoryinterleave”and“router.”However,sinceinourcaselengthofdescriptionsareInthesecondtypeofquestion,examineeschoosesignificantlydifferent,shorterdescriptionsaremorethemostappropriatetermforagivenquestion,forlikelytobeselected,regardlessofthequality.Toavoidwhichweshowexamplescollectedfromtheexami-thisproblem,wenormalizeP(d)bythenumberofnationintheautumnof1999(translatedintoEnglishwordscontainedind.byoneoftheauthors)asfollows:4Application1.WhichdatastructureismostappropriateforFIFO(First-InFirst-Out)?4.1Overviewa)binarytrees,b)queues,c)stacks,d)heapsEncyclopediasgeneratedthroughourWeb-basedmethodcanbeusedinanumberofapplications,in-2.ChoosetheLANaccessmethodinwhichmul-cludinghumanusage,thesaurusproduction(Hearst,tipleterminalstransmitdatasimultaneouslyand1992;NakamuraandNagao,1988)andnaturallan-4JapanInformation-TechnologyEngineersExaminationguageunderstandingingeneral.Center.http://www.jitec.jipdec.or.jp/ thustheypotentiallycollide.However,MoldovanandHarabagiu(2000)foundthateachoftheTRECquestionscanberecastasei-a)ATM,b)CSM/CD,c)FDDI,d)tokenringtherasingleaxisoracombinationofaxes.TheyalsoIntheautumnof1999,outof80questions,thenum-foundthatoutofthe200TRECquestions,64ques-berofthefirstandsecondtypeswere22and18,re-tions(approximatelyonethird)wereassociatedwithspectively.thewhataxis,forwhichtheWeb-basedencyclopediaisexpectedtoimprovethequalityofanswers.4.3ImplementingaQAsystemAlthoughHarabagiuetal.(2000)proposedaknowledge-basedQAsystem,mostexistingsystemsForthefirsttypeofquestion,humanexamineeswouldrelyonconventionalIRandshallowNLPmethods.searchtheirknowledgebase(i.e.,memory)forthede-TheuseofencyclopedicknowledgeforQAsystems,scriptionofagiventerm,andcomparethatdescriptionaswedemonstrated,needstobefurtherexplored.withfourcandidates.Thentheywouldchoosethecan-didatethatismostsimilartothedescription.5ExperimentationForthesecondtypeofquestion,humanexamineeswouldsearchtheirknowledgebaseforthedescription5.1Methodologyofeachoffourcandidateterms.ThentheywouldWeconductedanumberofexperimentstoinvestigatechoosethecandidatetermwhosedescriptionismosttheeffectivenessofourmethods.similartothequestiondescription.First,wegeneratedanencyclopediabywayofourThemechanismofourQAsystemisanalogoustoWeb-basedmethod(seeSections2and3),andevalu-theabovehumanmethods.However,unlikehumanatedthequalityoftheencyclopediaitself.examinees,oursystemusesanencyclopediageneratedSecond,weappliedthegeneratedencyclopediatofromtheWebasaknowledgebase.ourQAsystem(seeSection4),andevaluateditsper-Inaddition,oursystemselectivelyusestermde-formance.Thesecondexperimentcanbeseenasascriptionscategorizedintodomainsrelatedtoinfor-task-orientedevaluationforourencyclopediagenera-mationtechnology.Inotherwords,thedescriptiontionmethod.of“pipeline(transportationpipe)”isirrelevantormis-Inthefirstexperiment,wecollected96termsfromleadingtoanswerquestionsassociatedwith“pipelinetechnicaltermquestionsintheClassIIexamination(processingmethod).”(theautumnof1999).Weusedastestinputsthose96Tocomputethesimilaritybetweentwodescriptions,termsandgeneratedanencyclopedia,whichwasusedweusedtechniquesdevelopedinIRresearch,inwhichinthesecondexperiment.thesimilaritybetweenauserqueryandeachdocumentForallthe96testterms,Google(seeSection2.2)inacollectionisusuallyquantifiedbasedonwordfre-retrievedapositivenumberofpages,andtheaveragequencies.Inourcase,aquestionandfourpossiblenumberofpagesforonetermwas196,503.Sinceanswerscorrespondtoqueryanddocumentcollection,Googlepracticallyoutputscontentsofthetop1,000respectively.Weusedaprobabilisticmethod(Robert-pages,theremainingpageswerenotusedinourex-sonandWalker,1994),whichisoneofthemajorIRperiments.methods.Inthefollowingtwosections,weexplainthefirstTosumup,givenaquestion,itstypeandfourandsecondexperiments,respectively.choices,ourQAsystemchoosesoneoffourcandi-datesastheanswer,inwhichtheresolutionalgorithm5.2EvaluatingEncyclopediaGenerationvariesdependingonthequestiontype.Foreachtestterm,ourmethodfirstcomputedP(d|c)usingEquation(1)anddiscardeddomainswhose4.4RelatedWorkP(d|c)wasbelow0.05.Then,foreachremainingdo-MotivatedpartiallybytheTREC-8QAcollec-main,descriptionswithhigherP(d|c)wereselectedastion(VoorheesandTice,2000),questionansweringthefinaloutputs.hasoflatebecomeoneofthemajortopicswithintheWeselectedthetopthree(notone)descriptionsforNLP/IRcommunities.eachdomain,becausereadingacoupleofdescriptions,Infact,anumberofQAsystemstargetingwhichareshortparagraphs,isnotlaboriousforhumantheTRECQAcollectionhaverecentlybeenpro-usersinreal-worldusage.Asaresult,atleastonede-posed(Harabagiuetal.,2000;Moldovanandscriptionwasgeneratedfor85testterms,disregardingHarabagiu,2000;Prageretal.,2000).Thosesys-thecorrectness.Thenumberofresultantdescriptionstemsarecommonlytermed“open-domain”systems,was326(3.8perterm).Weanalyzedthosedescrip-becausequestionsexpressedinnaturallanguagearetionsfromdifferentperspectives.notnecessarilylimitedtoexplicitaxes,includingwho,First,weanalyzedthedistributionoftheGooglewhat,when,where,howandwhy.ranksfortheWebpagesfromwhichthetopthreede- scriptionswereeventuallyretained.Figure2showsofcorrectdescriptions,disregardingthedomaincor-theresult,wherewehavecombinedthepagesinrectness,was58.0%(189/326),andtheratioofcor-groupsof50,sothattheleftmostbar,forexample,de-rectdescriptionscategorizedintothecorrectdomainnotesthenumberofusedpageswhoseoriginalGooglewas47.9%(156/326).ranksrangedfrom1to50.However,sinceallthetesttermsareinherentlyre-AlthoughthefirstgroupincludesthelargestnumberlatedtotheITfield,wefocusedsolelyondescriptionsofpages,othergroupsarealsorelatedtoarelativelycategorizedintothecomputerdomain.Inthiscase,largenumberofpages.Inotherwords,ourmethodtheratioofcorrectdescriptions,disregardingthedo-exploitedanumberoflowrankingpages,whicharemaincorrectness,was62.0%(124/200),andtherationotbrowsedorutilizedbymostWebusers.ofcorrectdescriptionscategorizedintothecorrectdo-mainwas61.5%(123/200).70Inaddition,weanalyzedtheresultonaterm-by-termbasis,becausereadingonlyacoupleofdescrip-60tionsisnotcrucial.Inotherwords,weevaluated50eachterm(notdescription),andinthecasewhereat40leastonecorrectdescriptioncategorizedintothecor-rectdomainwasgeneratedforaterminquestion,we30#ofpagesjudgeditcorrect.Theratioofcorrecttermswas89.4%20(76/85),andinthecasewherewefocusedsolelyonthe10computerdomain,theratiowas84.8%(67/79).Inotherwords,byreadingacoupleofdescriptions001002003004005006007008009001000(3.8descriptionsperterm),humanuserscanobtainrankingknowledgeofapproximately90%ofinputterms.Finally,wecomparedtheresultantdescriptionswithFigure2:Distributionofrankingsfororiginalpagesinanexistingdictionary.Forthispurpose,weusedtheGoogle.“Nichigai”computerdictionary(NichigaiAssociates,1996),whichlistsapproximately30,000JapaneseSecond,weanalyzedthedistributionofdomainstechnicaltermsrelatedtothecomputerfield,andcon-assignedtothe326resultantdescriptions.Figure3tainsdescriptionsfor13,588terms.IntheNichigaishowstheresult,inwhich,asexpected,mostdescrip-dictionary,42outofthe96testtermsweredescribed.tionswereassociatedwiththecomputerdomain.Ourmethod,whichgeneratedcorrectdescriptionsas-However,thelawdomainwasunexpectedlyasso-sociatedwiththecomputerdomainfor67inputterms,ciatedwitharelativelygreatnumberofdescriptions.enhancedtheNichigaidictionaryintermsofquantity.WemanuallyanalyzedtheresultantdescriptionsandTheseresultsindicatethatourmethodforgenerat-foundthatdescriptionsforwhichappropriatedomainsingencyclopediasisofoperationalquality.arenotdefinedinourdomainmodel,suchassports,tendedtobecategorizedintothelawdomain.5.3EvaluatingQuestionAnsweringWeusedastestinputs40questions,whicharerelatedcomputers(200),law(41),electricity(28),totechnicaltermscollectedfromtheClassIIexami-plants(15),medicine(10),finance(8),mathematics(8),mechanics(5),biotechnology(4),nationintheautumnof1999.construction(2),ecology(2),chemistry(1),Theobjectivehereisnotonlytoevaluatetheperfor-energy(1),oceanography(1)manceofourQAsystemitself,butalsotoevaluatethequalityoftheencyclopediageneratedbyourmethod.Figure3:Distributionofdomainsrelatedtothe326Thus,asperformedinthefirstexperiment(Sec-resultantdescriptions.tion5.2),weusedtheNichigaicomputerdictionaryasabaselineencyclopedia.WecomparedthefollowingThird,weevaluatedtheaccuracyofourmethod,threedifferentresourcesasaknowledgebase:thatis,thequalityofanencyclopediaourmethodgen-erated.Forthispurpose,eachoftheresultantdescrip-•theNichigaidictionary(“Nichigai”),tionswasjudgedastowhetherornotitisacorrectde-scriptionforaterminquestion.Eachdomainassigned•thedescriptionsgeneratedinthefirstexperimenttodescriptionswasalsojudgedcorrectorincorrect.(“Web”),Weanalyzedtheresultonadescription-by-descriptionbasis,thatis,allthegenerateddescriptions•combinationofbothresources(“Nichigai+wereconsideredindependentofoneanother.TheratioWeb”). Table1showstheresultofourcomparativeexper-ingtheterm,b)extractionofpagefragmentsdescrib-iment,inwhich“C”and“A”denotecoverageandac-ingtheterm,andc)organizingextracteddescriptionscuracy,respectively,forvariationsofourQAsystem.basedondomains(andconsequentlywordsenses).Sinceallthequestionsweusedarequadruple-Inaddition,weproposedaquestionansweringsys-choice,incasethesystemcannotanswerthequestion,tem,whichanswersinterrogativequestionsassociatedrandomchoicecanbeperformedtoimprovethecov-withwhat,byusingaWeb-basedencyclopediaasaerageto100%.Thus,foreachknowledgeresourceweknowledgebase.Forthepurposeofevaluation,wecomparedcaseswithout/withrandomchoice,whichusedastestinputstechnicaltermscollectedfromthearedenoted“w/oRandom”and“w/Random”inTa-ClassIIITengineersexamination,andfoundthattheble1,respectively.encyclopediageneratedthroughourmethodwasofoperationalqualityandquantity.WealsousedtestquestionsfromtheClassIIexam-Table1:Coverageandaccuracy(%)fordifferentques-ination,andevaluatedtheWeb-basedencyclopediaintionansweringmethods.termsofquestionanswering.WefoundthatourWeb-w/oRandomw/Randombasedencyclopediaimprovedthesystemcoverageob-ResourceCACAtainedsolelywithanexistingdictionary.Inaddition,Nichigai50.065.010045.0whenweusedbothresources,theperformancewasWeb92.548.610046.9furtherimproved.Nichigai+Web95.063.210061.3Futureworkwouldincludegeneratinginformationassociatedwithmorecomplexinterrogations,suchasInthecasewhererandomchoicewasnotper-onesrelatedtohowandwhy,soastoenhanceWeb-formed,theWeb-basedencyclopedianoticeablyim-basednaturallanguageunderstanding.provedthecoveragefortheNichigaidictionary,butdecreasedtheaccuracy.However,bycombiningbothAcknowledgmentsresources,theaccuracywasnoticeablyimproved,andTheauthorswouldliketothankNOVA,Inc.fortheirthecoveragewascomparablewiththatfortheNichi-supportwiththeNovadictionaryandKatunobuItougaidictionary.(TheNationalInstituteofAdvancedIndustrialScienceOntheotherhand,inthecasewhererandomchoiceandTechnology,Japan)forhisinsightfulcommentsonwasperformed,theNichigaidictionaryandtheWeb-thispaper.basedencyclopediawerecomparableintermsofboththecoverageandaccuracy.Additionally,bycombin-ingbothresources,theaccuracywasfurtherimproved.ReferencesWealsoinvestigatedtheperformanceofourQAsystemwheredescriptionsrelatedtothecomputerdo-BrianAmento,LorenTerveen,andWillHill.2000.mainaresolelyused.However,coverage/accuracydidDoes“authority”meanquality?predictingexpertnotsignificantlychange,becauseasshowninFigure3,qualityratingsofWebdocuments.InProceedingsmostofthedescriptionswereinherentlyrelatedtotheofthe23rdAnnualInternationalACMSIGIRCon-computerdomain.ferenceonResearchandDevelopmentinInforma-tionRetrieval,pages296–303.6ConclusionLalit.R.Bahl,FrederickJelinek,andRobertL.Mer-TheWorldWideWebhasbeenanunprecedentedlycer.1983.Amaximumlinklihoodapproachtocontinuousspeechrecognition.IEEETransac-enormousinformationsource,fromwhichanumbertionsonPatternAnalysisandMachineIntelligence,oflanguageprocessingmethodshavebeenexplored5(2):179–190.toextract,retrieveanddiscovervarioustypesofinfor-mation.SergeyBrinandLawrencePage.1998.TheanatomyInthispaper,weaimedatgeneratingencyclopedicofalarge-scalehypertextualWebsearchengine.knowledge,whichisvaluableformanyapplicationsComputerNetworks,30(1–7):107–117.includinghumanusageandnaturallanguageunder-standing.Forthispurpose,wereformalizedanexist-PeterF.Brown,StephenA.DellaPietra,VincentJ.DellaPietra,andRobertL.Mercer.1993.TheingWeb-basedextractionmethod,andproposedanewmathematicsofstatisticalmachinetranslation:Pa-statisticalorganizationmodeltoimprovethequalityoframeterestimation.ComputationalLinguistics,extracteddata.19(2):263–311.Givenatermforwhichencyclopedicknowledge(i.e.,descriptions)istobegenerated,ourmethodse-PhilipClarksonandRonaldRosenfeld.1997.Statisti-quentiallyperformsa)retrievalofWebpagescontain-callanguagemodelingusingtheCMU-Cambridge toolkit.InProceedingsofEuroSpeech'97,pagesJohnPrager,EricBrown,andAnniCoden.2000.2707–2710.Question-answeringbypredictiveannotation.InProceedingsofthe23rdAnnualInternationalACMOrenEtzioni.1997.MovinguptheinformationfoodSIGIRConferenceonResearchandDevelopmentinchain.AIMagazine,18(2):11–18.InformationRetrieval,pages184–191.AtsushiFujiiandTetsuyaIshikawa.2000.UtilizingPhilipResnik.1999.MiningtheWebforbilingualtheWorldWideWebasanencyclopedia:Extract-texts.InProceedingsofthe37thAnnualMeetingingtermdescriptionsfromsemi-structuredtexts.oftheAssociationforComputationalLinguistics,InProceedingsofthe38thAnnualMeetingofthepages527–534.AssociationforComputationalLinguistics,pages488–495.S.E.RobertsonandS.Walker.1994.Somesimpleeffectiveapproximationstothe2-poissonmodelforSandaM.Harabagiu,MariusA.Pas¸ca,andStevenJ.probabilisticweightedretrieval.InProceedingsofMaiorano.2000.Experimentswithopen-domainthe17thAnnualInternationalACMSIGIRConfer-textualquestionanswering.InProceedingsoftheenceonResearchandDevelopmentinInformation18thInternationalConferenceonComputationalRetrieval,pages232–241.Linguistics,pages292–298.HinrichSch¨utze.1998.Automaticwordsensedis-MartiA.Hearst.1992.Automaticacquisitionofhy-crimination.ComputationalLinguistics,24(1):97–ponymsfromlargetextcorpora.InProceedings123.ofthe14thInternationalConferenceonComputa-tionalLinguistics,pages539–545.StephenSoderland.1997.Learningtoextracttext-basedinformationfromtheWorldWideWeb.InHitachiDigitalHeibonsha.1998.CD-ROMWorldProceedingsof3rdInternationalConferenceonEncyclopedia.(InJapanese).KnowledgeDiscoveryandDataMining.AkihiroInokuchi,TakashiWashio,HiroshiMotoda,KouheiKumasawa,andNaohideArai.1999.Bas-EllenM.VoorheesandDawnM.Tice.2000.Buildingketanalysisforgraphstructureddata.InProceed-aquestionansweringtestcollection.InProceed-ingsofthe3rdPacific-AsiaConferenceonKnowl-ingsofthe23rdAnnualInternationalACMSIGIRedgeDiscoveryandDataMining,pages420–431.ConferenceonResearchandDevelopmentinInfor-mationRetrieval,pages200–207.MakotoIwayamaandTakenobuTokunaga.1994.Aprobabilisticmodelfortextcategorization:BasedDavidYarowsky.1995.Unsupervisedwordsensedis-onasinglerandomvariablewithmultiplevalues.Inambiguationrivalingsupervisedmethods.InPro-Proceedingsofthe4thConferenceonAppliedNat-ceedingsofthe33rdAnnualMeetingoftheAssocia-uralLanguageProcessing,pages162–167.tionforComputationalLinguistics,pages189–196.YujiMatsumoto,AkiraKitauchi,TatsuoYamashita,XiaolanZhuandSusanGauch.2000.IncorporatingYoshitakaHirano,OsamuImaichi,andTomoakiqualitymetricsincentralized/distributedinforma-Imamura.1997.JapanesemorphologicalanalysistionretrievalontheWorldWideWeb.InProceed-systemChaSenmanual.TechnicalReportNAIST-ingsofthe23rdAnnualInternationalACMSIGIRIS-TR97007,NAIST.(InJapanese).ConferenceonResearchandDevelopmentinInfor-mationRetrieval,pages288–295.AndrewMcCallum,KamalNigam,JasonRennie,andKristieSeymore.1999.Amachinelearningap-proachtobuildingdomain-specificsearchengines.InProceedingsofthe16thInternationalJointCon-ferenceonArtificialIntelligence,pages662–667.DanMoldovanandSandaHarabagiu.2000.Thestructureandperformanceofanopen-domainques-tionansweringsystem.InProceedingsofthe38thAnnualMeetingoftheAssociationforComputa-tionalLinguistics,pages563–570.Jun'ichiNakamuraandMakotoNagao.1988.Extrac-tionofsemanticinformationfromanordinaryEn-glishdictionaryanditsevaluation.InProceedingsofthe10thInternationalConferenceonComputa-tionalLinguistics,pages459–464.NichigaiAssociates.1996.English-Japanesecom-puterterminologydictionary.(InJapanese).

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
大家都在看
近期热门
关闭