共查询到18条相似文献,搜索用时 250 毫秒
1.
Toshiaki Nakazaw Manabu Yaguchi Kiyotaka Uchimoto Masao Utiyam Eiichiro Sumit Sadao Kurohashi Hitoshi Isahar 何彦青 刘建辉 《情报工程》2017,3(3):040-046
本文详细介绍了ASPEC(亚洲科学论文摘录语料库)。作为首个大规模的科学论文领域内的平行语料库,ASPEC 是由日- 中机器翻译项目于 2006 年至2010 年间利用科技促进专用协作基金构建起来的。它包含约300 万条平行语句的日- 英科学论文摘要语料库(ASPEC-JE)和约68万条平行语句的中- 日科学论文摘录语料库(ASPEC-JC)。ASPEC 被用作机器翻译评测研讨会WAT(亚洲翻译研讨会)的官方数据集。 相似文献
2.
基于双语平行语料库的信息服务平台建设 总被引:2,自引:0,他引:2
王传英 《图书馆工作与研究》2010,(12)
公共图书馆是公共信息的主要提供者,而各类外文文献信息资源则是公共信息的重要组成部分.语言障碍已经成为限制我国图书馆外文文献利用的最大"瓶颈".上世纪80年代我国开始的语料库建设为翻译教育和翻译产业的发展奠定了坚实的基础,其中双语平行语料库在教学单位和翻译公司得到了广泛应用.为了克服语言障碍,翻译公司和公共图书馆应该发挥各自的资源、技术优势,合作建设基于双语平行语料库的信息服务平台,以改善公共信息服务的质量和功能. 相似文献
3.
4.
面向双语术语抽取这一应用目标,提出专业领域可比语料库的构建方案并进行实验论证。针对给定的主题领域分别进行中英文专业语料的采集,从中分别获取中英文关键词,根据词语共现统计获取该主题领域的其他相关关键词;以这些关键词作为查询入口,通过学术搜索引擎从网络获取候选可比语料;对可比语料进行定量评估,以剔除不符合要求的语料,最终得到特定主题领域的可比语料库。 相似文献
5.
为满足用户对多语言信息表达与获取的迫切需求,可比语料库的研究和开发逐渐成为信息检索领域研究者和系统开发人员的关注重点.从跨语言信息检索视角出发,可比语料库的构建方法主要有提问式翻译法、特征过滤法、中间语言翻译法、文本翻译法和同源匹配法等.我国可比语料库的建设,应在充分考虑系统整体性能的前提下,根据用户需求选择适用的构建方法,完善文本翻译及术语抽取技术,并优化文本对齐方式. 相似文献
6.
7.
面向大规模语料库的全文检索系统研究 总被引:1,自引:0,他引:1
随着语料库规模的不断扩大和基于语料库的应用研究逐步拓展,对语料库的全文检索成为语料库系统中不可缺少的重要的组成部分。文章对面向大规模语料库的全文检索系统的索引模式、检索算法、检索表达式的构建、自动分词、系统组成等进行了研究,并基于大规模语料库的语言文字信息处理和应用研究的需要,开发了中文信息处理系统——“CIPP”。目前该系统具有全文检索、自动分词、语言统计等功能,在千万字数量级的语料库中,其全文平均检索时间小于1秒。 相似文献
8.
对双语术语抽取技术中的一项重要分支:基于可比语料库的双语术语抽取技术进行了综述分析.当前研究者采用的方法依据是"上下文相似"理论,即两个在源语言共现的词,对应到目标语言中的两个词也将共现.当前技术主要包含候选词的上下文特征的模型构造和上下文特征模型的优化.对已有的研究给出了一个初步的评价标准,分别对两项研究按照方法复杂度层次进行分析总结,指出存在的问题.最后对基于可比语料库的双语术语抽取技术的未来进行了展望. 相似文献
9.
10.
11.
双语语料库在机器翻译、跨语言信息检索以及翻译词典编纂等自然语言处理领域有着越来越重要的用途。该研究利用同族专利文献信息作为双语语料的来源,探讨了基于同族专利获取双语语料的可行性,以获取汉英双语语料为实例提出了双语语料的获取流程,同时进行双语对译部分的对齐规则的研究,从而构建出科技领域的平行双语语料库。最后,还阐述了该方法的相关注意事项以及应用前景。 相似文献
12.
针对专利文献句子偏长的特点,将统计机器翻译中的训练语料进行子句切割获取双语的子句序列,再采
用统计和规则相结合的策略来生成子句对齐,建立基于简单子句的双语语料来重新训练统计机器翻译系统,在一定程
度上改善了原有双语训练语料中的短语对齐和词对齐,可以更为深入地利用平行语料中蕴含的翻译信息,应用于专利
统计机器翻译中,在NTCIR-9的测试集上进行实验比较,获得较为满意的翻译效果。 相似文献
13.
《广播与电子媒介杂志》2013,57(3):537-554
This study employs a critical historical approach to situate a corpus of 106 post-9/11 anti-Arab Web cartoons as populist wartime narrative that remediates U.S. racist animation and racist wartime cartoons produced during World War II. Analysis of the production, distribution, and exhibition circumstances, as well as general narrative strategies deployed in the animations, demonstrates that these amateur texts resurrect and reproduce racist narrative strategies employed historically in professionally produced government-sanctioned animation. These cartoons illustrate how animators can use the Web as a folk venue for racist wartime animations that are currently unrepresentable by dominant mass media. 相似文献
14.
15.
Anchor texts complement Web page content and have been used extensively in commercial Web search engines. Existing methods
for anchor text weighting rely on the hyperlink information which is created by page content editors. Since anchor texts are
created to help user browse the Web, browsing behavior of Web users may also provide useful or complementary information for
anchor text weighting. In this paper, we discuss the possibility and effectiveness of incorporating browsing activities of
Web users into anchor texts for Web search. We first make an analysis on the effectiveness of anchor texts with browsing activities.
And then we propose two new anchor models which incorporate browsing activities. To deal with the data sparseness problem
of user-clicked anchor texts, two features of user’s browsing behavior are explored and analyzed. Based on these features,
a smoothing method for the new anchor models is proposed. Experimental results show that by incorporating browsing activities
the new anchor models outperform the state-of-art anchor models which use only the hyperlink information. This study demonstrates
the benefits of Web browsing activities to affect anchor text weighting. 相似文献
16.
The need to cluster small text corpora composed of a few hundreds of short texts rises in various applications; e.g., clustering top-retrieved documents based on their snippets. This clustering task is challenging due to the vocabulary mismatch between short texts and the insufficient corpus-based statistics (e.g., term co-occurrence statistics) due to the corpus size. We address this clustering challenge using a framework that utilizes a set of external knowledge resources that provide information about term relations. Specifically, we use information induced from the resources to estimate similarity between terms and produce term clusters. We also utilize the resources to expand the vocabulary used in the given corpus and thus enhance term clustering. We then project the texts in the corpus onto the term clusters to cluster the texts. We evaluate various instantiations of the proposed framework by varying the term clustering method used, the approach of projecting the texts onto the term clusters, and the way of applying external knowledge resources. Extensive empirical evaluation demonstrates the merits of our approach with respect to applying clustering algorithms directly on the text corpus, and using state-of-the-art co-clustering and topic modeling methods. 相似文献
17.
Focused web crawling in the acquisition of comparable corpora 总被引:2,自引:0,他引:2
Tuomas Talvensaari Ari Pirkola Kalervo Järvelin Martti Juhola Jorma Laurikkala 《Information Retrieval》2008,11(5):427-445
Cross-Language Information Retrieval (CLIR) resources, such as dictionaries and parallel corpora, are scarce for special domains.
Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast volumes
of data, offers a natural source for this. We experimented with focused crawling as a means to acquire comparable corpora
in the genomics domain. The acquired corpora were used to statistically translate domain-specific words. The same words were
also translated using a high-quality, but non-genomics-related parallel corpus, which fared considerably worse. We also evaluated
our system with standard information retrieval (IR) experiments, combining statistical translation using the Web corpora with
dictionary-based translation. The results showed improvement over pure dictionary-based translation. Therefore, mining the
Web for comparable corpora seems promising. 相似文献
18.
面对日益膨胀的多语种信息资源,跨语言信息检索已成为实现全球知识存取和共享的关键技术手段。构建一个实用型的跨语言检索查询翻译接口,可方便地嵌入任意的信息检索平台,扩展现有信息检索平台的多语言信息处理能力。该查询翻译接口采用基于最长短语、查询分类和概率词典等多种翻译消歧策略,并从查询翻译的准确性和接口的运行效率两个角度对构建的查询翻译接口进行评测,实验结果验证所采用方法具有可行性。 相似文献