欢迎来到天天文库
浏览记录
ID:31853562
大小:2.35 MB
页数:75页
时间:2019-01-21
《基于html的web就业信息抽取技术分析》由会员上传分享,免费在线阅读,更多相关内容在应用文档-天天文库。
1、ABSTRACTWiththeincreasingubiquityofthecomputerandInternet,ithasbeenbecominganimportantchannelforPeopletoseareehforinformation.Asanenormousdatasource,retrivinginformationfromWebisoneofthehotpointsintheinformationstudyfieldnow.Withthecollegeenrollmentinourcountryhasbeenexpandingeachyears
2、,whichgivesthestudenteducationandemploymentmanypressures.Wehopetoobtainalargeamountaboutemploymentinformationfrominternet,whichhasbeenprovidedguidingsignifieancetospecialtyconstructionandstudentemployment.ThemostofthismasswebdataofinternetarebasedonthesemistructuredHTMLformat.Thetextba
3、seonHTMLstructureisnotstrictlyandthesemanticsisnotclear.Peoplecan'tfindtherequireddataquicklyandaccuratelyfromthewebdata,howtoquicklyandaccuratelyobtainthesedataisaurgentproblemneedtoresolve.SointhispaperitpresentsanewmodelbasedonHTMLstructurethatextractsinformationfromwebemploymentinf
4、ormation.ItiscomposedofHTMLstructurepretreatmentmodule,tablepositioningmoduleandinformationextractionmodule.Thefirst,JtidyisusetocleantheWebPagecodewhichisconvertedintoXMLdocuments.ThentheDOMtreeofWebinformationisfoundintheanalysisofXML.Inthelast,Throughalargenumberofobservation,weobta
5、intheheuristicrulesaboutlocatethepositonofthegenuinetableandalgorithmsaredesignedandimplemented.Thispaperconsiderssuchlayouttypeasthecross-rowandcross-columninstance,whichmakeeachdataunitandthecorrespondingpropertynotcorresponded,sotablesarestandardizedsothateachroworcolumnarealignedwi
6、ththesamenumberofcells.TheexperimentalresultsperformedonmultipleWebsitesshowedII万方数据thattheapproachforWebdataextractioncouldextractemploymentinformationinWeb.ItcanbeappliedinextractsemploymentinformationfromWebandotherfurtherstudyandperformwell.KeyWords:InformationExtraction,HTML,DOMtr
7、ee,WebtableIII万方数据目录摘要..............................................................IABSTRACT...........................................................II第一章绪论.........................................................11.1研究背景及意义...............................................11.1.1研究背景
此文档下载收益归作者所有