自动提取布局结构相似网页的结构化信息 Automatically Extracting Structured Data from Web Pages of Similar Structure or Layout期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

自动提取布局结构相似网页的结构化信息

引用本文：	赵靖,王侨文,管马周,单传佳. 自动提取布局结构相似网页的结构化信息[J]. 安徽科技学院学报, 2010, 24(6). DOI: 10.3969/j.issn.1673-8772.2010.06.009

作者姓名：	赵靖王侨文管马周单传佳

作者单位：	安徽科技学院?? 理学院,安徽?? 凤阳??,233100;安徽科技学院?? 理学院,安徽?? 凤阳??,233100;安徽科技学院?? 理学院,安徽?? 凤阳??,233100;安徽科技学院?? 理学院,安徽?? 凤阳??,233100

基金项目：	安徽科技学院引进人才基金(ZRC2008191); 安徽省教育厅自然科学重点项目基金(KJ2008A112); 安徽科技学院大学生创新课题基金(10XSZ58)

摘要：	数据库驱动的Web站点根据查询产生的Web页结构布局都是极其相似的;现有的Web提取方法忽视或者忽略了这种相似性,因而在提取效率性能和通用性上都有较大的限制。本文提出一种基于标签树相似度的模板自动学习方法;进而根据模板来提取这类网页的数据;并利用Eclipse和开源HTML Parser对算法进行了实现;实验结果表明该算法具有较快的提取速度和较好的准确率。
关键词：	深层W eb 标签树-相似度模型结构化信息提取
Automatically Extracting Structured Data from Web Pages of Similar Structure or Layout

ZHAO Jing,WANG Qiao-wen,GUAN Ma-zhou,SHAN Chuan-jia. Automatically Extracting Structured Data from Web Pages of Similar Structure or Layout[J]. Journal of Anhui Science and Technology University, 2010, 24(6). DOI: 10.3969/j.issn.1673-8772.2010.06.009

Authors:	ZHAO Jing WANG Qiao-wen GUAN Ma-zhou SHAN Chuan-jia

Affiliation:	ZHAO Jing,WANG Qiao-wen,GUAN Ma-zhou,SHAN Chuan-jia(Science College,Anhui Science , Technology University,Fengyang 233100,China)

Abstract:	Database-driven web sites generate HTML pages in similar structure or layout.Traditional web information extraction methods often neglect or fail to use this similarity directly,so their efficiency and precision are generally poor.In order to exact structured data from structure-alike web pages,we presented a new model,Tag-Tree similarity model,which extends the conceptions and algebra of traditional embedded set,and a template method based on Tag-Tree for extraction in this paper.The result of our experime...

Keywords:	Deep Web Tag Tree-similarity model Structural information extraction
本文献已被 CNKI 万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏