首页 | 本学科首页   官方微博 | 高级检索  
     


Focused web crawling in the acquisition of comparable corpora
Authors:Tuomas Talvensaari  Ari Pirkola  Kalervo Järvelin  Martti Juhola  Jorma Laurikkala
Affiliation:(1) Department of Computer Sciences, University of Tampere, Kanslerinrinne 1, Tampere, 33014, Finland;(2) Department of Information Studies, University of Tampere, Kanslerinrinne 1, Tampere, 33014, Finland
Abstract:Cross-Language Information Retrieval (CLIR) resources, such as dictionaries and parallel corpora, are scarce for special domains. Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast volumes of data, offers a natural source for this. We experimented with focused crawling as a means to acquire comparable corpora in the genomics domain. The acquired corpora were used to statistically translate domain-specific words. The same words were also translated using a high-quality, but non-genomics-related parallel corpus, which fared considerably worse. We also evaluated our system with standard information retrieval (IR) experiments, combining statistical translation using the Web corpora with dictionary-based translation. The results showed improvement over pure dictionary-based translation. Therefore, mining the Web for comparable corpora seems promising.
Keywords:Cross-language information retrieval  Focused crawling  Comparable corpora
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号