欢迎来到天天文库
浏览记录
ID:52286766
大小:366.38 KB
页数:29页
时间:2020-03-26
《剑桥Information retrieval(信息检索)课件2.pdf》由会员上传分享,免费在线阅读,更多相关内容在教育资源-天天文库。
1、DRAFT!©April1,2009CambridgeUniversityPress.Feedbackwelcome.192ThetermvocabularyandpostingslistsRecallthemajorstepsininvertedindexconstruction:1.Collectthedocumentstobeindexed.2.Tokenizethetext.3.Dolinguisticpreprocessingoftokens.4.Indexthedocumentsthateachtermoccursin.Inthischap
2、terwefirstbrieflymentionhowthebasicunitofadocumentcanbedefinedandhowthecharactersequencethatitcomprisesisdetermined(Section2.1).Wethenexamineindetailsomeofthesubstantivelinguis-ticissuesoftokenizationandlinguisticpreprocessing,whichdeterminethevocabularyoftermswhichasystemuses(Sect
3、ion2.2).Tokenizationistheprocessofchoppingcharacterstreamsintotokens,whilelinguisticprepro-cessingthendealswithbuildingequivalenceclassesoftokenswhicharethesetoftermsthatareindexed.IndexingitselfiscoveredinChapters1and4.Thenwereturntotheimplementationofpostingslists.InSection2.3
4、,weexamineanextendedpostingslistdatastructurethatsupportsfasterquery-ing,whileSection2.4coversbuildingpostingsdatastructuressuitableforhandlingphraseandproximityqueries,ofthesortthatcommonlyappearinbothextendedBooleanmodelsandontheweb.2.1Documentdelineationandcharactersequencede
5、coding2.1.1ObtainingthecharactersequenceinadocumentDigitaldocumentsthataretheinputtoanindexingprocessaretypicallybytesinafileoronawebserver.Thefirststepofprocessingistoconvertthisbytesequenceintoalinearsequenceofcharacters.ForthecaseofplainEn-glishtextinASCIIencoding,thisistrivial
6、.ButoftenthingsgetmuchmoreOnlineedition(c)2009CambridgeUP202Thetermvocabularyandpostingslistscomplex.Thesequenceofcharactersmaybeencodedbyoneofvarioussin-glebyteormultibyteencodingschemes,suchasUnicodeUTF-8,orvariousnationalorvendor-specificstandards.Weneedtodeterminethecorrecten
7、-coding.Thiscanberegardedasamachinelearningclassificationproblem,asdiscussedinChapter13,1butisoftenhandledbyheuristicmethods,userselection,orbyusingprovideddocumentmetadata.Oncetheencodingisdetermined,wedecodethebytesequencetoacharactersequence.Wemightsavethechoiceofencodingbecau
8、seitgivessomeevidenceaboutwhatlan-guagethedocum
此文档下载收益归作者所有