资源描述:
《基于网页上下文的deepweb数据库分类》由会员上传分享,免费在线阅读,更多相关内容在行业资料-天天文库。
1、ISSN1000-9825,CODENRUXUEWE-mail:jos@iscas.ac.cnJournalofSoftware,Vol.19,No.2,February2008,pp.267−274http://www.jos.org.cnDOI:10.3724/SP.J.1001.2008.00267Tel/Fax:+86-10-62562563©2008byJournalofSoftware.Allrightsreserved.∗基于网页上下文的DeepWeb数据库分类+马军,宋玲,韩晓晖,闫泼(山东大学计算机科学与技术学院,山东济南250101)Classifi
2、cationofDeepWebDatabasesBasedontheContextofWebPages+MAJun,SONGLing,HANXiao-Hui,YANPo(SchoolofComputerScienceandTechnology,ShandongUniversity,Ji’nan250101,China)+Correspondingauthor:Phn:+86-531-88391528,Fax:+86-531-88392498,E-mail:majun@sdu.edu.cn,http://ir.sdu.edu.cnMaJ,SongL,HanXH,YanP.
3、ClassificationofdeepWebdatabasesbasedonthecontextofWebpages.JournalofSoftware,2008,19(2):267−274.http://www.jos.org.cn/1000-9825/19/267.htmAbstract:NewtechniquesarediscussedforenhancingtheclassificationprecisionofdeepWebdatabases,whichincludeutilizingthecontenttextsoftheHTMLpagescontaini
4、ngthedatabaseentryformsasthecontextandaunificationprocessingforthedatabaseattributelabels.AnalgorithmtofindoutthecontenttextsinHTMLpagesisdevelopedbasedonmultiplestatisticcharacteristicsofthetextblocksinHTMLpages.Theunificationprocessingfordatabaseattributesistolettheattributelabelsthata
5、reclosedsemanticallybereplacedwithdelegates.Thedomainandlanguageknowledgefoundinlearningsamplesisrepresentedinhierarchicalfuzzysetsandanalgorithmfortheunificationprocessingisproposedbasedonthepresentation.Basedonthepre-computingak-NN(knearestneighbors)algorithmisgivenfordeepWebdatabasecl
6、assification,wherethesemanticdistancebetweentwodatabasesiscalculatedbasedonboththedistancebetweenthecontenttextsoftheHTMLpagesandthedistancebetweendatabaseformsembeddedinthepages.Variousclassificationexperimentsarecarriedouttocomparetheclassificationresultsdonebythealgorithmwithpre-compu
7、tingandtheonewithoutthepre-computingintermsofclassificationprecision,recallandF1values.Keywords:deepWeb;hiddenWeb;databaseclassification;contenttextextraction;semanticclassification摘要:讨论了提高DeepWeb数据库分类准确性的若干新技术,其中包括利用HTML网页的内容文本作为理解数据库内容的上下文和把数据库表的属性标记词归一的过程.其中对网页中的内容文本的发