欢迎来到天天文库
浏览记录
ID:36768661
大小:2.47 MB
页数:54页
时间:2019-05-15
《主题型网页的信息抽取技术研究》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、RESEARCHoNTHETECHNOLOGYOFEXTRACTINGINFoRM睑TIoNFROMTHETHEME.BASEDWEBPAGESABSTRACTWiththeIntemettechnologydevelopment,worldwidewebasarisinginformationmedtediahasinfluencedmtluenceallaaspectspectsOtof;ocialSOelalactivitiesmcluamgincludingintbrmationeconomy,culture,educati
2、onandentertainment.Ithasbecomeanimportantpartofourdailylive.Theme—basedpagesasoneofthemostimportantkindofwebpagesincludesnewspages,BBS,Blogs.Theme-basedpagesisahugeinformationbankincludingpublicopinionandknowledgeespeciallyinBBSandBlogswhichhaveraiseconceminthepublic.S
3、oitisgoodforSociologystudiesandpublicopinioncollectionanddataminingtostudythetechnologythatextractinformationfromtheme-basedpages.Themainstudymadethefollowingpointsinthispaper:(1)AmethodtoestimatetheimageinformationandimageeffectiveinformationWasproposedinthispaper.Ano
4、velalgorithmformaintextofwebpagelocalizationbasedonimageandtexteffectiveinformationWasalsopresentedinthispaper.Thenoisewasreducedbythemethodthatlocatethemaintextofwebpage.Theexperimentsshowthatthismethodhasabetterreducednoiseeffects(2)Opposedthelackofpastweb—pages’revi
5、ewsdiscoveryalgorithm.Anew-nI-Suffix-treebasedweb—pages’reviewsdiscoveryalgorithmwasproposed,whichCallautomaticallyextractthecontentofreviewwithoutlabeledinstanceorcalculatingthesimilarityofsub-treeorsettingthemall—madethreshold.Theexperimentsshowthatthisalgorithmhasbe
6、tteraccuracyrateandrecall.KEYWORDS:informationextraction;effectiveinformation;discoverytheme-basedpages;imageandtextrepeatpattem;web—pages’review目录摘要⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯.IABSTI认CT⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯..IIl第一章绪论⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯..11.1研究背景⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯.11.2主题型网页
7、信息抽取算法的研究现状⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯..31.2.1基于自然语言的信息抽取算法⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯..31.2.2基于机器学习的信息抽取算法⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯..3第二章基础知识和相关3.3.1相关定义⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯.233.3.1网页图片有效信息量的计算⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯243.3.2算法描述⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯.263.4实验结果及分析⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯.273.5本章小结⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯.30第四章基于后缀树的主题型网页数据区域发现与抽
8、取⋯⋯⋯⋯⋯⋯⋯⋯314.1引言⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯..314.2相关算法研究⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯314.3主题型网页结构的特点⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯.324.3.1视觉特征⋯⋯⋯⋯⋯⋯⋯⋯
此文档下载收益归作者所有