基于HTML树的网页结构相似度研究 Study on Web Structural Similarities Based on HTML Tree期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于HTML树的网页结构相似度研究

引用本文：	宋明秋,张瑞雪.基于HTML树的网页结构相似度研究[J].情报学报,2011,30(2).

作者姓名：	宋明秋张瑞雪

作者单位：	大连理工大学系统工程研究所,大连,116023

基金项目：	国家自然科学基金资助项目(70671016)

摘要：	HTML网页信息是一种半结构化的数据,而且不同网页之间在其结构特征方面都具有一定的相似性。本文就是从信息的结构性角度来研究不同网页信息块之间的相似性,并提出了基于子树最优自由匹配规则的结构相似度度量模型以及利用网页结构相似性提取网页信息的方法。本文中的计算方法都用python语言实现。通过实验,本文对不同网页之间的相似度进行了计算和分析,实验数据表明,基于子树最优自由匹配规则的树结构相似度度量模型具有较好的系统性和适用性;通过树结构相似度来确定网页内部元素及两个网页之间的联系,也弥补了传统方法中依赖单调的文本信息比较的不足,使得网页信息提取更加准确,更加迅速。
关键词：	HTML树结构相似度自由匹配信息提取
Study on Web Structural Similarities Based on HTML Tree

Song Mingqiu,Zhang Ruixue.Study on Web Structural Similarities Based on HTML Tree[J].Journal of the China Society for Scientific andTechnical Information,2011,30(2).

Authors:	Song Mingqiu Zhang Ruixue

Institution:	Song Mingqiu and Zhang Ruixue (Institute of System Engineering,Dalian University of Technology,Dalian 116023)

Abstract:	HTML web information is a kind of semi-structured data,and different web pages always have some kind of similarity in structure.From the perspective of information structure,this paper has studied the similarity between two different blocks of web information,and proposed a new model of calculating structural similarity based on optimally free matching on sub trees and a method of extracting web information by using structural similarity.All of algorithms in this paper are implemented by Python.We have calc...

Keywords:	HTML tree structural similarity free matching information extracting
本文献已被 CNKI 万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏