资源描述:
《modeling the internet and the web probabilistic methods and algorithms》由会员上传分享,免费在线阅读,更多相关内容在教育资源-天天文库。
1、ModelingtheInternetandtheWebProbabilisticMethodsandAlgorithmsPierreBaldiSchoolofInformationandComputerScience,UniversityofCalifornia,Irvine,USAPaoloFrasconiDepartmentofSystemsandComputerScience,UniversityofFlorence,ItalyPadhraicSmythSchoolofInformationandComputerScience,UniversityofCali
2、fornia,Irvine,USA4TextAnalysisHavingfocusedinearlierchaptersonthegeneralstructureoftheWeb,inthischapterwewilldiscussinsomedetailtechniquesforanalyzingthetextualcontentofindivid-ualWebpages.Thetechniquespresentedherehavebeendevelopedwithinthefieldsofinformationretrieval(IR)andmachinelearn
3、ingandincludeindexing,scoring,andcategorizationoftextualdocuments.ThefocusofIRisthatofaccessingasefficientlyaspossibleandasaccuratelyaspossibleasmallsubsetofdocumentsthatismaximallyrelatedtosomeuserinter-est.Userinterestcanbeexpressedforexamplebyaqueryspecifiedbytheuser.Retrievalincludest
4、woseparatesubproblems:indexingthecollectionofdocumentsinordertoimprovethecomputationalefficiencyofaccess,andrankingdocumentsaccordingtosomeimportancecriterioninordertoimproveaccuracy.Categoriza-tionorclassificationofdocumentsisanotherusefultechnique,somewhatrelatedtoinformationretrieval,t
5、hatconsistsofassigningadocumenttooneormorepredefinedcategories.Aclassifiercanbeused,forexample,todistinguishbetweenrelevantandirrelevantdocuments(wheretherelevancecanbepersonalizedforaparticularuserorgroupofusers),ortohelpinthesemiautomaticconstructionoflargeWeb-basedknowledgebasesorhiera
6、rchicaldirectoriesoftopicsliketheOpenDirectory(http://dmoz.org/).AvastportionoftheWebconsistsoftextdocuments–thus,methodsforauto-maticallyanalyzingtexthavegreatimportanceinthecontextoftheWeb.Ofcourse,retrievalandclassificationmethodsfortext,suchasthosereviewedinthischaptercanbespecialize
7、dormodifiedforothertypesofWebdocumentssuchasimages,audioorvideo(see,forexample,DelBimbo1999),butourfocusinthischapterwillbeontext.4.1Indexing4.1.1BasicconceptsInordertoretrievetextdocumentsefficientlyitisnecessarytoenrichthecollectionwithspecializeddatastructuresthatfacilitateacc