首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
袁煜 《人天科学研究》2011,10(1):186-188
正则表达式的应用贯穿于基于语料库的外语研究与教学实践,从语料准备、语料标注、到建库、库文件管理、语料检索都离不开这种功能强大的形式语言。研究通过示例介绍了正则表达在语料处理、语料库课堂教学及基于语料库的个性化研究的三个主要方面的应用,并针对初学正则表达式的外语研究、教学人员给出了编写建议。  相似文献   

2.
社会化标注系统中标签的语义模糊性和形式不规范使得资源管理与共享越来越困难,为准确定位标签语义,文章从扩展标签语义与涌现标签语义两个方面,对标签语义检索研究现状进行了综述,分析了社会化标注系统中标签语义检索的研究动态和不足,并总结得出可计算性高、可操作性强、能智能获取标签的语义关系是社会化标注系统标签语义检索的未来研究方向。  相似文献   

3.
本文系统阐述了一个计算机领域英汉双语语料库平台的构建方案,包括降噪处理,特征语块提取,关键词标注,中文分词,词频信息统计,段对齐标注,句对齐标注10步,在此基础上我们提出特征语块的定义。  相似文献   

4.
基于大规模现代汉语标注语料库,总结构建了现代汉语的工具句型系统和句模系统,并就二者的对应关系进行考察,发现了一些高频句干。  相似文献   

5.
为了响应国家"数字化校园"的号召,给广大学生提供一个方便、生动的信息技术环境,本文设计了远程多媒体教室。多媒体教室可以实现如支持多种广播模式、远程控制DVD的播放、视频对讲交互、现场录制课堂、下载与播放同步等多种功能。本设计包括实现远程多媒体教室系统的总体、分教室结构及软件系统,采用多种视频及通信技术,给出主、分教室的示意图和实物连接图,核算出所需的成本,并提出了进一步的优化建议。  相似文献   

6.
本文探讨了AuthorWare制作多媒体课件时背景音乐(声音)的控制方法。通过使用Sound音乐图标、DireetMediaXtra外挂插件、外部函数这三种方法,使课件中声音的播放和控制更加方便、快捷。同时使多媒体课件更加灵活、生动。  相似文献   

7.
法律框架网络语料库系统构建的目的是实现对法律语料的处理,从而为法律语言学及法律查询者提供强而有力的检索工具。本文提出了法律框架网络语料库系统的设计原则及语料选取原则,并讨论了系统设计模型、数据库设计以及实现功能,尤其是语料统计功能、知识发现功能使该系统具有比一般语料库系统更为复杂的功能。  相似文献   

8.
网络信息检索技术现状、瓶颈及趋势分析   总被引:25,自引:0,他引:25  
龚蛟腾 《情报杂志》2004,23(5):75-77
目前网络信息检索技术主要有资源定位检索技术、超链接搜索技术、网络搜索引擎技术及通用信息检索技术,制约网络信息检索技术发展的瓶颈是图像音频视频检索、汉语自动切分、搜索引擎缺陷等。智能检索技术、知识检索技术、多媒体检索技术、新一代搜索引擎技术、自然语言检索技术和基于内容的检索技术是网络信息检索技术发展的核心与关键。  相似文献   

9.
标注语料库中句子的语义信息应该要有一套完整的规范体系。文章通过利用汉语框架和框架元素体系进行标注,首先介绍基于汉语框架的语义标注方法的概念,接着从句法性能、短语类型和框架元素标注这三个方面对汉语框架语义标注的具体规则进行阐述;最后,通过与其他语义标注方法的比较,分析得出框架语义标注的特点。  相似文献   

10.
刘芳 《科教文汇》2014,(26):123-124
本文采用语料库分析方法,对双语文本进行对齐、标注并建库,使用语料库检索软件检索出各类文化负载词,并对其翻译策略进行分析和统计。数据显示,葛浩文对《红高粱家族》中文化负载词的翻译主要采取归化的翻译策略,这一结论既符合中国文学英译的主流模式也符合译者的翻译思想。通过对各类文化负载词的观察,发现译者对于难以寻找对等词的物质文化负载词和容易产生文化冲突的社会文化负载词和宗教文化负载词更倾向于使用归化策略,而为了保留原著的语言特征和异国情调,在语言文化负载词的翻译中大量采用了异化的策略。  相似文献   

11.
In this paper, we compile and review several experiments measuring cross-lingual information retrieval (CLIR) performance as a function of the following resources: bilingual term lists, parallel corpora, machine translation (MT), and stemmers. Our CLIR system uses a simple probabilistic language model; the studies used TREC test corpora over Chinese, Spanish and Arabic. Our findings include:
  • •One can achieve an acceptable CLIR performance using only a bilingual term list (70–80% on Chinese and Arabic corpora).
  • •However, if a bilingual term list and parallel corpora are available, CLIR performance can rival monolingual performance.
  • •If no parallel corpus is available, pseudo-parallel texts produced by an MT system can partially overcome the lack of parallel text.
  • •While stemming is useful normally, with a very large parallel corpus for Arabic–English, stemming hurt performance in our empirical studies with Arabic, a highly inflected language.
  相似文献   

12.
Pseudo-relevance feedback (PRF) is a classical technique to improve search engine retrieval effectiveness, by closing the vocabulary gap between users’ query formulations and the relevant documents. While PRF is typically applied on the same target corpus as the final retrieval, in the past, external expansion techniques have sometimes been applied to obtain a high-quality pseudo-relevant feedback set using the external corpus. However, such external expansion approaches have only been studied for sparse (BoW) retrieval methods, and its effectiveness for recent dense retrieval methods remains under-investigated. Indeed, dense retrieval approaches such as ANCE and ColBERT, which conduct similarity search based on encoded contextualised query and document embeddings, are of increasing importance. Moreover, pseudo-relevance feedback mechanisms have been proposed to further enhance dense retrieval effectiveness. In particular, in this work, we examine the application of dense external expansion to improve zero-shot retrieval effectiveness, i.e. evaluation on corpora without further training. Zero-shot retrieval experiments with six datasets, including two TREC datasets and four BEIR datasets, when applying the MSMARCO passage collection as external corpus, indicate that obtaining external feedback documents using ColBERT can significantly improve NDCG@10 for the sparse retrieval (by upto 28%) and the dense retrieval (by upto 12%). In addition, using ANCE on the external corpus brings upto 30% NDCG@10 improvements for the sparse retrieval and upto 29% for the dense retrieval.  相似文献   

13.
曲琳琳 《情报科学》2021,39(8):132-138
【目的/意义】跨语言信息检索研究的目的即在消除因语言的差异而导致信息查询的困难,提高从大量纷繁 复杂的查找特定信息的效率。同时提供一种更加方便的途径使得用户能够使用自己熟悉的语言检索另外一种语 言文档。【方法/过程】本文通过对国内外跨语言信息检索的研究现状分析,介绍了目前几种查询翻译的方法,包括: 直接查询翻译、文献翻译、中间语言翻译以及查询—文献翻译方法,对其效果进行比较,然后阐述了跨语言检索关 键技术,对使用基于双语词典、语料库、机器翻译技术等产生的歧义性提出了解决方法及评价。【结果/结论】使用自 然语言处理技术、共现技术、相关反馈技术、扩展技术、双向翻译技术以及基于本体信息检索技术确保知识词典的 覆盖度和歧义性处理,通过对跨语言检索实验分析证明采用知识词典、语料库和搜索引擎组合能够提高查询效 率。【创新/局限】本文为了解决跨语言信息检索使用词典、语料库中词语缺乏的现象,提出通过搜索引擎从网页获 取信息资源来充实语料库中语句对不足的问题。文章主要针对中英文信息检索问题进行了探讨,解决方法还需要 进一步研究,如中文切词困难以及字典覆盖率低等严重影响检索的效率。  相似文献   

14.
A main challenge in Cross-Language Information Retrieval (CLIR) is to estimate a proper translation model from available translation resources, since translation quality directly affects the retrieval performance. Among different translation resources, we focus on obtaining translation models from comparable corpora, because they provide appropriate translations for both languages and domains with limited linguistic resources. In this paper, we employ a two-step approach to build an effective translation model from comparable corpora, without requiring any additional linguistic resources, for the CLIR task. In the first step, translations are extracted by deriving correlations between source–target word pairs. These correlations are used to estimate word translation probabilities in the second step. We propose a language modeling approach for the first step, where modeling based on probability distribution provides two key advantages. First, our approach can be tuned easier in comparison with heuristically adjusted previous work. Second, it provides a principled basis for integrating additional lexical and translational relations to improve the accuracy of translations from comparable corpora. As an indication, we integrate monolingual relations of word co-occurrences into the process of translation extraction, which helps to extract more reliable translations for low-frequency words in a comparable corpus. Experimental results on an English–Persian comparable corpus show that our method outperforms the previous approaches in terms of both translation quality and the performance of CLIR. Indeed, the proposed method is naturally applicable to any comparable corpus, regardless of its languages. In addition, we demonstrate the significant impact of word translation probabilities, estimated in the second step of our approach, on the performance of CLIR.  相似文献   

15.
OCR errors in text harm information retrieval performance. Much research has been reported on modelling and correction of Optical Character Recognition (OCR) errors. Most of the prior work employ language dependent resources or training texts in studying the nature of errors. However, not much research has been reported that focuses on improving retrieval performance from erroneous text in the absence of training data. We propose a novel approach for detecting OCR errors and improving retrieval performance from the erroneous corpus in a situation where training samples are not available to model errors. In this paper we propose a method that automatically identifies erroneous term variants in the noisy corpus, which are used for query expansion, in the absence of clean text. We employ an effective combination of contextual information and string matching techniques. Our proposed approach automatically identifies the erroneous variants of query terms and consequently leads to improvement in retrieval performance through query expansion. Our proposed approach does not use any training data or any language specific resources like thesaurus for identification of error variants. It also does not expend any knowledge about the language except that the word delimiter is blank space. We have tested our approach on erroneous Bangla (Bengali in English) and Hindi FIRE collections, and also on TREC Legal IIT CDIP and TREC 5 Confusion track English corpora. Our proposed approach has achieved statistically significant improvements over the state-of-the-art baselines on most of the datasets.  相似文献   

16.
The generation of stereotypes allows us to simplify the cognitive complexity we have to deal with in everyday life. Stereotypes are extensively used to describe people who belong to a different ethnic group, particularly in racial hoaxes and hateful content against immigrants. This paper addresses the study of stereotypes from a novel perspective that involves psychology and computational linguistics both. On the one hand, it describes an Italian social media corpus built within a social psychology study, where stereotypes and related forms of discredit were made explicit through annotation. On the other hand, it provides some lexical analysis, to bring out the linguistic features of the messages collected in the corpus, and experiments for validating this annotation scheme and its automatic application to other corpora in the future. The main expected outcome is to shed some light on the usefulness of this scheme for training tools that automatically detect and label stereotypes in Italian.  相似文献   

17.
钟小丹  冯宗祥 《科教文汇》2013,(19):125-127
本文基于自建美国第一夫人米歇尔·奥巴马微型演讲语料库和VOA新闻语料库,根据李文中的主题词提取方法及方乐、狄安娜的有关公共演讲理论,利用WordSmith这一词汇检索软件,分析了米歇尔在演讲中不同类型主题词的作用:第一二人称代词,与听众建立直接联系;名词类,诠释共同的价值观、人生观;积极形容词,激励鼓舞听众。本文为英语演讲的研究提供一定的借鉴。  相似文献   

18.
司莉  何依 《现代情报》2016,36(6):165-170
语料库是指根据一定的方法收集的自然出现语料构成的电子数据库。2000年以来我国对多语言语料库的研究呈现快速上升的趋势。在全面文献调研的基础上,本文对我国多语言语料库的研究现状进行了归纳与梳理。国内学者对多语言语料库的研究多集中于语言学领域,其次是计算机领域。研究主题主要分布在多语言语料库的关键技术研究、多语言语料库的应用研究两大方面。  相似文献   

19.
The paper presents new annotated corpora for performing stance detection on Spanish Twitter data, most notably Health-related tweets. The objectives of this research are threefold: (1) to develop a manually annotated benchmark corpus for emotion recognition taking into account different variants of Spanish in social posts; (2) to evaluate the efficiency of semi-supervised models for extending such corpus with unlabelled posts; and (3) to describe such short text corpora via specialised topic modelling.A corpus of 2,801 tweets about COVID-19 vaccination was annotated by three native speakers to be in favour (904), against (674) or neither (1,223) with a 0.725 Fleiss’ kappa score. Results show that the self-training method with SVM base estimator can alleviate annotation work while ensuring high model performance. The self-training model outperformed the other approaches and produced a corpus of 11,204 tweets with a macro averaged f1 score of 0.94. The combination of sentence-level deep learning embeddings and density-based clustering was applied to explore the contents of both corpora. Topic quality was measured in terms of the trustworthiness and the validation index.  相似文献   

20.
The paper describes the OntoNotes, a multilingual (English, Chinese and Arabic) corpus with large-scale semantic annotations, including predicate-argument structure, word senses, ontology linking, and coreference. The underlying semantic model of OntoNotes involves word senses that are grouped into so-called sense pools, i.e., sets of near-synonymous senses of words. Such information is useful for many applications, including query expansion for information retrieval (IR) systems, (near-)duplicate detection for text summarization systems, and alternative word selection for writing support systems. Although a sense pool provides a set of near-synonymous senses of words, there is still no knowledge about whether two words in a pool are interchangeable in practical use. Therefore, this paper devises an unsupervised algorithm that incorporates Google n-grams and a statistical test to determine whether a word in a pool can be substituted by other words in the same pool. The n-gram features are used to measure the degree of context mismatch for a substitution. The statistical test is then applied to determine whether the substitution is adequate based on the degree of mismatch. The proposed method is compared with a supervised method, namely Linear Discriminant Analysis (LDA). Experimental results show that the proposed unsupervised method can achieve comparable performance with the supervised method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号