首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we compile and review several experiments measuring cross-lingual information retrieval (CLIR) performance as a function of the following resources: bilingual term lists, parallel corpora, machine translation (MT), and stemmers. Our CLIR system uses a simple probabilistic language model; the studies used TREC test corpora over Chinese, Spanish and Arabic. Our findings include:
  • •One can achieve an acceptable CLIR performance using only a bilingual term list (70–80% on Chinese and Arabic corpora).
  • •However, if a bilingual term list and parallel corpora are available, CLIR performance can rival monolingual performance.
  • •If no parallel corpus is available, pseudo-parallel texts produced by an MT system can partially overcome the lack of parallel text.
  • •While stemming is useful normally, with a very large parallel corpus for Arabic–English, stemming hurt performance in our empirical studies with Arabic, a highly inflected language.
  相似文献   

2.
Semantic representation reflects the meaning of the text as it may be understood by humans. Thus, it contributes to facilitating various automated language processing applications. Although semantic representation is very useful for several applications, a few models were proposed for the Arabic language. In that context, this paper proposes a graph-based semantic representation model for Arabic text. The proposed model aims to extract the semantic relations between Arabic words. Several tools and concepts have been employed such as dependency relations, part-of-speech tags, name entities, patterns, and Arabic language predefined linguistic rules. The core idea of the proposed model is to represent the meaning of Arabic sentences as a rooted acyclic graph. Textual entailment recognition challenge is considered in order to evaluate the ability of the proposed model to enhance other Arabic NLP applications. The experiments have been conducted using a benchmark Arabic textual entailment dataset, namely, ArbTED. The results proved that the proposed graph-based model is able to enhance the performance of the textual entailment recognition task in comparison to other baseline models. On average, the proposed model achieved 8.6%, 30.2%, 5.3% and 16.2% improvement in terms of accuracy, recall, precision, and F-score results, respectively.  相似文献   

3.
陶卫冬 《科教文汇》2020,(14):170-171,190
英语平行文本的阅读和借鉴不仅能在应用翻译领域帮助译者摆脱源语文化的束缚、提升译文的地道性,而且能在大学英语教学中帮助学生评估和提升语言输出质量。然而理想的英语平行文本在网络上的检索并非易事。本文针对如今大学生英语学习中信息素养能力欠缺的现状,从搜索方法、信息筛选、信息加工三个方面进行研究,旨在帮助大学生提高英语平行文本网络检索能力。  相似文献   

4.
Arabic is a widely spoken language but few mining tools have been developed to process Arabic text. This paper examines the crime domain in the Arabic language (unstructured text) using text mining techniques. The development and application of a Crime Profiling System (CPS) is presented. The system is able to extract meaningful information, in this case the type of crime, location and nationality, from Arabic language crime news reports. The system has two unique attributes; firstly, information extraction that depends on local grammar, and secondly, dictionaries that can be automatically generated. It is shown that the CPS improves the quality of the data through reduction where only meaningful information is retained. Moreover, the Self Organising Map (SOM) approach is adopted in order to perform the clustering of the crime reports, based on crime type. This clustering technique is improved because only refined data containing meaningful keywords extracted through the information extraction process are inputted into it, i.e. the data are cleansed by removing noise. The proposed system is validated through experiments using a corpus collated from different sources; it was not used during system development. Precision, recall and F-measure are used to evaluate the performance of the proposed information extraction approach. Also, comparisons are conducted with other systems. In order to evaluate the clustering performance, three parameters are used: data size, loading time and quantization error.  相似文献   

5.
This paper describes and evaluates various stemming and indexing strategies for the Czech language. Based on Czech test-collection, we have designed and evaluated two stemming approaches, a light and a more aggressive one. We have compared them with a no stemming scheme as well as a language-independent approach (n-gram). To evaluate the suggested solutions we used various IR models, including Okapi, Divergence from Randomness (DFR), a statistical language model (LM) as well as the classical tf idf vector-space approach. We found that the Divergence from Randomness paradigm tend to propose better retrieval effectiveness than the Okapi, LM or tf idf models, the performance differences were however statistically significant only with the last two IR approaches. Ignoring the stemming reduces generally the MAP by more than 40%, and these differences are always significant. Finally, if our more aggressive stemmer tends to show the best performance, the differences in performance with a light stemmer are not statistically significant.  相似文献   

6.
Sentiment analysis (SA) is a continuing field of research that lies at the intersection of many fields such as data mining, natural language processing and machine learning. It is concerned with the automatic extraction of opinions conveyed in a certain text. Due to its vast applications, many studies have been conducted in the area of SA especially on English texts, while other languages such as Arabic received less attention. This survey presents a comprehensive overview of the works done so far on Arabic SA (ASA). The survey groups published papers based on the SA-related problems they address and tries to identify the gaps in the current literature laying foundation for future studies in this field.  相似文献   

7.
A large body of research work has proposed verification techniques for rumors spreading in social media that mainly relied on subjective evidence, e.g., propagation networks or user interactions. Alternatively, in this work, we introduce the task of authority finding in social media, in which we aim to find authorities, for given rumors spreading specifically in Twitter, who can help verify them by providing exclusive/convincing evidence that supports or denies those rumors. We release the first test collection for Authority FINding in Arabic Twitter (AuFIN). The collection comprises 150 rumors (expressed in tweets) associated with a total of 1,044 authority accounts and a user collection of 395,231 Twitter accounts (members of 1,192,284 unique Twitter lists). Moreover, we propose a hybrid model that employs pre-trained language models and combines lexical, semantic, and network signals to find authorities. Our experiments show that the textual representation of users is insufficient, and incorporating the Twitter network features improved the recall of authorities by 34%. Moreover, semantic ranking is inferior to the lexical and network-based ranking in terms of precision, but superior in terms of recall. Therefore, combining both the semantic and network-based ranking achieved the best overall performance achieving a precision of 0.413 and 0.213 at depth 1 and 5 respectively. We show that rumor expansion by exploiting Knowledge Bases improves the recall of authorities by up to 15%. Furthermore, we find that SOTA models for topic expert finding perform poorly on finding authorities. Finally, drawing upon our experiments, we discuss failure factors and make recommendations for future research directions in addressing this task.  相似文献   

8.
Automated summaries help tackle the ever growing volume of information floating around. There are two broad categories: extract and abstract. In the former we retain the more important sentences more or less in their original structure, while the latter requires a fusion of multiple sentences and/or paraphrasing. This is a more challenging task than extract summaries. In this paper, we present a novel generic abstract summarizer for a single document in Arabic language. The system starts by segmenting the input text topic wise. Then, each textual segment is extractively summarized. Finally, we apply rule-based sentence reduction technique. The RST-based extractive summarizer is an enhanced version of the system in Azmi and Al-Thanyyan (2012). By controlling the size of the extract summary of each segment we can cap the size of the final abstractive summary. Both summarizers, the enhanced extractive and the abstractive, were evaluated. We tested our enhanced extractive summarizer on the same dataset in the aforementioned paper, using the measures recall, precision and Rouge. The results show noticeable improvement in the performance, specially the precision in shorter summaries. The abstractive summarizer was tested on a set of 150 documents, generating summaries of sizes 50%, 40%, 30% and 20% (of the original’s word count). The results were assessed by two human experts who graded them out of a maximum score of 5. The average score ranged between 4.53 and 1.92 for summaries at different granularities, with shorter summaries receiving the lower score. The experimental results are encouraging and demonstrate the effectiveness of our approach.  相似文献   

9.
词干化、词形还原是英文文本处理中的一个重要步骤。本文利用3种聚类算法对两个Stemming算法和一个Lemmatization算法进行较为全面的实验。结果表明,Stemming和Lemmatization都可以提高英文文本聚类的聚类效果和效率,但对聚类结果的影响并不显著。相比于Snowball Stemmer和Stanford Lemmatizer,Porter Stemmer方法在Entropy和Pu-rity表现上更好,也更为稳定。  相似文献   

10.
This paper reports on the underlying IR problems encountered when dealing with the complex morphology and compound constructions found in the Hungarian language. It describes evaluations carried out on two general stemming strategies for this language, and also demonstrates that a light stemming approach could be quite effective. Based on searches done on the CLEF test collection, we find that a more aggressive suffix-stripping approach may produce better MAP. When compared to an IR scheme without stemming or one based on only a light stemmer, we find the differences to be statistically significant. When compared with probabilistic, vector-space and language models, we find that the Okapi model results in the best retrieval effectiveness. The resulting MAP is found to be about 35% better than the classical tf idf approach, particularly for very short requests. Finally, we demonstrate that applying an automatic decompounding procedure for both queries and documents significantly improves IR performance (+10%), compared to word-based indexing strategies.  相似文献   

11.
This work addresses the information retrieval problem of auto-indexing Arabic documents. Auto-indexing a text document refers to automatically extracting words that are suitable for building an index for the document. In this paper, we propose an auto-indexing method for Arabic text documents. This method is mainly based on morphological analysis and on a technique for assigning weights to words. The morphological analysis uses a number of grammatical rules to extract stem words that become candidate index words. The weight assignment technique computes weights for these words relative to the container document. The weight is based on how spread is the word in a document and not only on its rate of occurrence. The candidate index words are then sorted in descending order by weight so that information retrievers can select the more important index words. We empirically verify the usefulness of our method using several examples. For these examples, we obtained an average recall of 46% and an average precision of 64%.  相似文献   

12.
A challenge for sentence categorization and novelty mining is to detect not only when text is relevant to the user’s information need, but also when it contains something new which the user has not seen before. It involves two tasks that need to be solved. The first is identifying relevant sentences (categorization) and the second is identifying new information from those relevant sentences (novelty mining). Many previous studies of relevant sentence retrieval and novelty mining have been conducted on the English language, but few papers have addressed the problem of multilingual sentence categorization and novelty mining. This is an important issue in global business environments, where mining knowledge from text in a single language is not sufficient. In this paper, we perform the first task by categorizing Malay and Chinese sentences, then comparing their performances with that of English. Thereafter, we conduct novelty mining to identify the sentences with new information. Experimental results on TREC 2004 Novelty Track data show similar categorization performance on Malay and English sentences, which greatly outperform Chinese. In the second task, it is observed that we can achieve similar novelty mining results for all three languages, which indicates that our algorithm is suitable for novelty mining of multilingual sentences. In addition, after benchmarking our results with novelty mining without categorization, it is learnt that categorization is necessary for the successful performance of novelty mining.  相似文献   

13.
In this paper, we present the first work on unsupervised dialectal Neural Machine Translation (NMT), where the source dialect is not represented in the parallel training corpus. Two systems are proposed for this problem. The first one is the Dialectal to Standard Language Translation (D2SLT) system, which is based on the standard attentional sequence-to-sequence model while introducing two novel ideas leveraging similarities among dialects: using common words as anchor points when learning word embeddings and a decoder scoring mechanism that depends on cosine similarity and language models. The second system is based on the celebrated Google NMT (GNMT) system. We first evaluate these systems in a supervised setting (where the training and testing are done using our parallel corpus of Jordanian dialect and Modern Standard Arabic (MSA)) before going into the unsupervised setting (where we train each system once on a Saudi-MSA parallel corpus and once on an Egyptian-MSA parallel corpus and test them on the Jordanian-MSA parallel corpus). The highest BLEU score obtained in the unsupervised setting is 32.14 (by D2SLT trained on Saudi-MSA data), which is remarkably high compared with the highest BLEU score obtained in the supervised setting, which is 48.25.  相似文献   

14.
Two probabilistic approaches to cross-lingual retrieval are in wide use today, those based on probabilistic models of relevance, as exemplified by INQUERY, and those based on language modeling. INQUERY, as a query net model, allows the easy incorporation of query operators, including a synonym operator, which has proven to be extremely useful in cross-language information retrieval (CLIR), in an approach often called structured query translation. In contrast, language models incorporate translation probabilities into a unified framework. We compare the two approaches on Arabic and Spanish data sets, using two kinds of bilingual dictionaries––one derived from a conventional dictionary, and one derived from a parallel corpus. We find that structured query processing gives slightly better results when queries are not expanded. On the other hand, when queries are expanded, language modeling gives better results, but only when using a probabilistic dictionary derived from a parallel corpus.We pursue two additional issues inherent in the comparison of structured query processing with language modeling. The first concerns query expansion, and the second is the role of translation probabilities. We compare conventional expansion techniques (pseudo-relevance feedback) with relevance modeling, a new IR approach which fits into the formal framework of language modeling. We find that relevance modeling and pseudo-relevance feedback achieve comparable levels of retrieval and that good translation probabilities confer a small but significant advantage.  相似文献   

15.
吴冰 《大众科技》2013,(5):185-187,189
语法是语言的组织规律,赋予了语言以结构系统。在外语教学中它起着举足轻重的作用。尽管人们强调语法,重视语法,但是在语法的教和学的过程中仍然存在有不少问题:语法讲解过细,概念模糊,学生懒于或难于记忆。针对此情况,可以在英语语法教学中恰当地运用一些简单的几何图形、线段、阿拉伯数字或简图等数学符号或数学知识来辅助英语语法教学,英语语法知识点就可以简单、直观、形象地被表现出来,从而达到简化英语语法教学、激发学生学习英语的自觉性和积极性,增强语法记忆的效果。  相似文献   

16.
This paper presents an algorithm for generating stemmers from text stemmer specification files. A small study shows that the generated stemmers are computationally efficient, often running faster than stemmers custom written to implement particular stemming algorithms. The stemmer specification files are easily written and modified by non-programmers, making it much easier to create a stemmer, or tune a stemmer's performance, than would be the case with a custom stemmer program. Stemmer generation is thus also human-resource efficient.  相似文献   

17.
We study the selection of transfer languages for different Natural Language Processing tasks, specifically sentiment analysis, named entity recognition and dependency parsing. In order to select an optimal transfer language, we propose to utilize different linguistic similarity metrics to measure the distance between languages and make the choice of transfer language based on this information instead of relying on intuition. We demonstrate that linguistic similarity correlates with cross-lingual transfer performance for all of the proposed tasks. We also show that there is a statistically significant difference in choosing the optimal language as the transfer source instead of English. This allows us to select a more suitable transfer language which can be used to better leverage knowledge from high-resource languages in order to improve the performance of language applications lacking data. For the study, we used datasets from eight different languages from three language families.  相似文献   

18.
Authorship analysis of electronic texts assists digital forensics and anti-terror investigation. Author identification can be seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for some of the candidate authors or there is a significant variation in the text-length among the available training texts of the candidate authors. Moreover, in this task usually there is no similarity between the distribution of training and test texts over the classes, that is, a basic assumption of inductive learning does not apply. In this paper, we present methods to handle imbalanced multi-class textual datasets. The main idea is to segment the training texts into text samples according to the size of the class, thus producing a fairer classification model. Hence, minority classes can be segmented into many short samples and majority classes into less and longer samples. We explore text sampling methods in order to construct a training set according to a desirable distribution over the classes. Essentially, by text sampling we provide new synthetic data that artificially increase the training size of a class. Based on two text corpora of two languages, namely, newswire stories in English and newspaper reportage in Arabic, we present a series of authorship identification experiments on various multi-class imbalanced cases that reveal the properties of the presented methods.  相似文献   

19.
One major approach for information finding in the WWW is to navigate through some Web directories and browse them until the goal pages were found. However, such directories are generally constructed manually and may have disadvantages of narrow coverage and inconsistency. Besides, most of existing directories provide only monolingual hierarchies that organized Web pages in terms that a user may not be familiar with. In this work, we will propose an approach that could automatically arrange multilingual Web pages into a multilingual Web directory to break the language barriers in Web navigation. In this approach, a self-organizing map is constructed to train each set of monolingual Web pages and obtain two feature maps, which reveal the relationships among Web pages and thematic keywords, respectively, for such language. We then apply a hierarchy generation process on these maps to obtain the monolingual hierarchy for these Web pages. A hierarchy alignment method is then applied on these monolingual hierarchies to discover the associations between nodes in different hierarchies. Finally, a multilingual Web directory is constructed according to such associations. We applied the proposed approach on a set of Web pages and obtained interesting result that demonstrates the feasibility of our method in multilingual Web navigation.  相似文献   

20.
Cross-lingual semantic interoperability has drawn significant attention in recent digital library and World Wide Web research as the information in languages other than English has grown exponentially. Cross-lingual information retrieval (CLIR) across different European languages, such as English, Spanish, and French, has been widely explored; however, CLIR across European languages and Oriental languages is still in the initial stage. To cross language boundary, corpus-based approach is promising to overcome the limitation of the knowledge-based and controlled vocabulary approaches but collecting parallel corpora between European language and Oriental language is not an easy task. Length-based and text-based approaches are two major approaches to align parallel documents. In this paper, we investigate several techniques using these approaches and compare their performances in aligning English and Chinese titles of parallel documents available on the Web.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号