首页 | 本学科首页   官方微博 | 高级检索  
     检索      

面向情报获取的主题采集工具设计与实现
引用本文:谷俊,翁佳,许鑫.面向情报获取的主题采集工具设计与实现[J].图书情报工作,2014,58(20):91-99.
作者姓名:谷俊  翁佳  许鑫
作者单位:1. 上海宝山钢铁股份有限公司; 2. 上海理工大学图书馆; 3. 华东师范大学商学院信息学系
基金项目:本文系上海市科技发展基金软科学研究项目“大数据环境下基于领域本体的情报处理分析方法研究——以钢铁行业为例”(项目编号:14692107100)研究成果之一。
摘    要:面向互联网的主题采集是情报获取的重要手段,面对爆发式增长的互联网信息资源,设计并实现一套由采集准备、URL分析及提取、模板学习、正文抽取等几阶段组成的主题采集工具,其中URL分析与提取采用基于链接类型的URL筛选方法,实现正文网页URL的筛选;模板学习和正文抽取部分采用基于DOM树的节点比对方法,完成模板的构建与正文抽取。实验结果表明,本文所提出的主题采集工具采集准确率较高,能够适应目前情报信息采集的需求。

关 键 词:网络爬虫  主题采集  链接筛选  DOM树  
收稿时间:2014-08-04
修稿时间:2014-09-04

Design and Implementation of the Topic Information Crawler for Intelligence Acquisition
Gu Jun,Weng Jia,Xu Xin.Design and Implementation of the Topic Information Crawler for Intelligence Acquisition[J].Library and Information Service,2014,58(20):91-99.
Authors:Gu Jun  Weng Jia  Xu Xin
Institution:1. Baoshan Iron and Steel Co. Ltd., Shanghai 201900; 2. University of Shanghai for Science and Technology, Shanghai 200093; 3. Department of Informatics, East China Normal University Business School, Shanghai 200241
Abstract:Topic information collection based on the Internet is an important means of acquiring intelligence. A topic information crawler is designed and realized to deal with the explosive growth of Internet information resources. The crawler comprises stages of acquisition preparation, URL analysis and extraction, template learning, and text extraction. A URL filtering method based on link types is used in the URL analysis and extraction stage to filter the URLs of text-containing Web pages. A node comparison method based on the DOM tree is used in the template learning and text extraction stages to construct templates and extract text. Test results show that the topic information crawler has a high accuracy in gathering information, and thus can meet the current need for information acquisition.
Keywords:Web crawler  topic information acquisition  link filtering  DOM tree  
点击此处可从《图书情报工作》浏览原始摘要信息
点击此处可从《图书情报工作》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号