基于Heritrix的增量式网络爬虫研究 |
| |
引用本文: | 张皓,周学广.基于Heritrix的增量式网络爬虫研究[J].人天科学研究,2013(11):135-137. |
| |
作者姓名: | 张皓 周学广 |
| |
作者单位: | 海军工程大学信息安全系,湖北武汉430033 |
| |
摘 要: | 通过分析开源网络爬虫Heritrix的工作原理及架构,针对Heritrix开源爬虫只能对全网站进行通爬的特点,对Heritrix进行改进,增加了基于Hash算法的增量式抓取模块。实验表明,改进的Heritrix能够有效实现对网页的增量式抓取。
|
关 键 词: | Heritrix Hash 网络爬虫 增量抓取 |
Research on Incremental Web Crawler Based on Heritrix |
| |
Abstract: | The working principle and structure analysis of open source web crawler Heritrix, for the Heritrix open source crawler can only climb to the site features, to improve Heritrix, increase the incremental crawler module based on Hash algorithm. Experiments show that the improved Heritrix can achieve, incremental crawl the webpage effectively. |
| |
Keywords: | Heritrix Hash Web Cramler Incremental |
本文献已被 维普 等数据库收录! |
|