欢迎来到天天文库
浏览记录
ID:265436
大小:2.36 MB
页数:81页
时间:2017-07-16
《基于结构和视觉特征的网页信息抽取技术的研究与实现硕士学位论文》由会员上传分享,免费在线阅读,更多相关内容在学术论文-天天文库。
1、浙江大学计算机科学与技术学院硕士学位论文基于结构和视觉特征的网页信息抽取技术的研究与实现姓名:朱凯申请学位级别:硕士专业:计算机应用技术指导教师:陈刚20080514浙江大学硕士学位论文AbstractThewebisperhapsthesinglelargestdatasourceintheworld,andmoreandmoreorganizationsreleasetheirdatathroughtheIntemet.Verticalsearchengines,alsoknownasdomainspe
2、cificsearchengine,sendtheirspidersouttoarefineddatabaseandcrawlsometypeofinformationfromthewebsite,thenposttheinformationforusertoqueryafterintegrationandpost—processing.Webinformationextractiontechnologyisthefundamentalforverticalsearchengine,andit’Salsot
3、hekernelmoduleofsearchengine’Sback-end.Developingextractionsystemmanuallymaybesimple,butithasmanywell—knownshortcomingssuchasit’SdifficulttOmaintainthembecausewebsitesalwayschangeinordertosurvive,anditneedonemoreprogramtosupportanewdatasourcewhichisawasteo
4、fresource.Thispaperpresentsavision-basedwebstructuralinformationextractiontechnology,whichnotonlymakeuseofthestructuralinformationofHTMLpages,butalsotakeagooduseofthevisioninformation.Itconsistsoftwosteps:(1)identifyindividualdatarecordsinapage,and(2)align
5、ingandextractingdataitemsfromtheidentifieddatarecords.Inthefirststep,visioninformationhelpstofilteroutmostofthenoiseinthewebpage,whichacceleratethealgorithmbasedonHTMLstructure,italsomakethealgorithmmoreaccurate.Inthesecondstep,theimprovedtreealignalgorith
6、misusedforthealignmentofattributes,whichisefficientandrobust.Andinthealignmentofmultipletrees,theintroductionofseedtreereducesthecomputationofthealgorithm,SOimprovetheperformancewhenthealgorithmappliedtOlargewebpages.Theexperimentsshowthattheextractmethodh
7、asahighdegreeofautomation,needalmostnomanualintervention.andit'salsoveryefficientandaccurate.Keywords:Verticalsearch,informationextraction,vision-based浙江大学硕士学位论文表目录表2.1各种工具的比较⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯20表6.1实验结果⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯一64表6.2横向比较结果⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯
8、⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯一66VI浙江大学硕士学位论文图目录图1.1一个垂直搜索引擎的体系结构⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯3图2.1网页信息抽取模型⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯.12图2.2形式化表示⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯.13图2.3数据来源网站一的数据格式⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯.14图2.4数据来源网站二的数据格式⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯.14图2.5互联
此文档下载收益归作者所有