Focused web crawling in the acquisition of comparable corpora期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Focused web crawling in the acquisition of comparable corpora

Authors:	Tuomas Talvensaari Ari Pirkola Kalervo Järvelin Martti Juhola Jorma Laurikkala

Affiliation:	(1) Department of Computer Sciences, University of Tampere, Kanslerinrinne 1, Tampere, 33014, Finland;(2) Department of Information Studies, University of Tampere, Kanslerinrinne 1, Tampere, 33014, Finland

Abstract:	Cross-Language Information Retrieval (CLIR) resources, such as dictionaries and parallel corpora, are scarce for special domains. Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast volumes of data, offers a natural source for this. We experimented with focused crawling as a means to acquire comparable corpora in the genomics domain. The acquired corpora were used to statistically translate domain-specific words. The same words were also translated using a high-quality, but non-genomics-related parallel corpus, which fared considerably worse. We also evaluated our system with standard information retrieval (IR) experiments, combining statistical translation using the Web corpora with dictionary-based translation. The results showed improvement over pure dictionary-based translation. Therefore, mining the Web for comparable corpora seems promising.

Keywords:	Cross-language information retrieval Focused crawling Comparable corpora
本文献已被 SpringerLink 等数据库收录！