基于Nutch的Web网站定向采集系统 Targeted Websites Harvest System Based on Nutch期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于Nutch的Web网站定向采集系统

引用本文：	徐健,张智雄.基于Nutch的Web网站定向采集系统[J].现代图书情报技术,2009,25(4):1-6.

作者姓名：	徐健张智雄

作者单位：	1. 中国科学院国家科学图书馆,北京,100190;中山大学资讯管理系,广州,510275 2. 中国科学院国家科学图书馆,北京,100190

基金项目：	国家科技支撵计划子课题

摘要：	在对目前具有代表性的开源网络抓取软件Nutch、Heritrix、WCT、Web-Harvest进行比较分析的基础上，提出基于Nutch的Web网站定向采集系统，并对种子站点的选取、抓取过程管理、网页去噪、新种子站点的发现等关键问题进行重点探讨。
关键词：	网站定向采集系统 Nutch 网站抓取网页去噪
收稿时间：	2009-02-17
修稿时间：	2009-04-01
Targeted Websites Harvest System Based on Nutch

Xu Jian,Zhang Zhixiong.Targeted Websites Harvest System Based on Nutch[J].New Technology of Library and Information Service,2009,25(4):1-6.

Authors:	Xu Jian Zhang Zhixiong

Institution:	1(National Science Library, Chinese Academy of Sciences, Beijing 100190, China) 2(Department of Information Management, Sun Yat-Sen University, Guangzhou 510275, China)

Abstract:	The paper analyzes typical open source Web crawl software, such as Nutch, Heritrix, WCT, and Web-Harvest. Following the analyzed result, it puts forward a targeted websites harvest system based on Nutch. Four key issues of this system are discussed emphatically, which are the initial seed websites selection, the harvest process management, the web page content denoising, and discovering of new seed websites.

Keywords:	Nutch
本文献已被万方数据等数据库收录！
	点击此处可从《现代图书情报技术》浏览原始摘要信息
	点击此处可从《现代图书情报技术》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏