首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到9条相似文献,搜索用时 15 毫秒
1.
一种快速中文分词词典机制   总被引:3,自引:0,他引:3  
通过研究目前中文分词领域各类分词机制,注意到中文快速分词机制的关键在于对单双字词的识别,在这一思想下,提出了一种快速中文分词机制:双字词-长词哈希机制,通过提高单双字词的查询效率来实现对中文分词机制的改进.实验证明,该机制提高了中文文本分词的效率.  相似文献   

2.
Chunking is a task which divides a sentence into non-recursive structures. The primary aim is to specify chunk boundaries and classes. Although chunking generally refers to simple chunks, it is possible to customize the concept. A simple chunk is a small structure, such as a noun phrase, while constituent chunk is a structure that functions as a single unit in a sentence, such as a subject. For an agglutinative language with a rich morphology, constituent chunking is a significant problem in comparison to simple chunking. Most of Turkish studies on this issue use the IOB tagging schema to mark the boundaries.In this study, we proposed a new simpler tagging schema, namely OE, in constituent chunking for Turkish. “E” represents the rightmost token of a chunk, while “O” stands for all other items. In reference to OE, we also used a schema called OB, where “B” represents the leftmost token of a chunk. We aimed to identify both chunk boundaries and chunk classes using the conditional random fields (CRF) method. The initial motivation was to employ the fact that Turkish phrases are head-final for chunking. In this context, we assumed that marking the end of a chunk (OE) would be more advantageous than marking the beginning of a chunk (OB). In support of the assumption, the test results reveal that OB has the worst performance and OE is significantly a more successful schema in many cases. Especially in long sentences, this contrast is more obvious. Indeed, using OE means simply marking the head of the phrase (chunk). Since the head and the distinctive label “E” are aligned, CRF finds the chunk class more easily by using the information contained in the head. OE also produced more successful results than the schemas available in the literature.In addition to comparing tagging schemas, we performed four analyses. Along with the examination of window size, which is a parameter of CRF, it is adequate to select and accept this value as 3. A comparison of the evaluation measures for chunking revealed that F-score was a more balanced measure in contrast to token accuracy and sentence accuracy. As a result of the feature analysis, syntactic features improves chunking performance significantly under all conditions. Yet when withdrawing these features, a pronounced difference between OB and OE is forthcoming. In addition, flexibility analysis shows that OE is more successful in different data.  相似文献   

3.
By the development of the computer in recent years, calculating a complex advanced processing at high speed has become possible. Moreover, a lot of linguistic knowledge is used in the natural language processing (NLP) system for improving the system. Therefore, the necessity of co-occurrence word information in the natural language processing system increases further and various researches using co-occurrence word information are done. Moreover, in the natural language processing, dictionary is necessary and indispensable because the ability of the entire system is controlled by the amount and the quality of the dictionary. In this paper, the importance of co-occurrence word information in the natural language processing system was described. The classification technique of the co-occurrence word (receiving word) and the co-occurrence frequency was described and the classified group was expressed hierarchically. Moreover, this paper proposes a technique for an automatic construction system and a complete thesaurus. Experimental test operation of this system and effectiveness of the proposal technique is verified.  相似文献   

4.
In this paper, we introduce a novel knowledge-based word-sense disambiguation (WSD) system. In particular, the main goal of our research is to find an effective way to filter out unnecessary information by using word similarity. For this, we adopt two methods in our WSD system. First, we propose a novel encoding method for word vector representation by considering the graphical semantic relationships from the lexical knowledge bases, and the word vector representation is utilized to determine the word similarity in our WSD system. Second, we present an effective method for extracting the contextual words from a text for analyzing an ambiguous word based on word similarity. The results demonstrate that the suggested methods significantly enhance the baseline WSD performance in all corpora. In particular, the performance on nouns is similar to those of the state-of-the-art knowledge-based WSD models, and the performance on verbs surpasses that of the existing knowledge-based WSD models.  相似文献   

5.
Discriminative sentence compression with conditional random fields   总被引:2,自引:0,他引:2  
The paper focuses on a particular approach to automatic sentence compression which makes use of a discriminative sequence classifier known as Conditional Random Fields (CRF). We devise several features for CRF that allow it to incorporate information on nonlinear relations among words. Along with that, we address the issue of data paucity by collecting data from RSS feeds available on the Internet, and turning them into training data for use with CRF, drawing on techniques from biology and information retrieval. We also discuss a recursive application of CRF on the syntactic structure of a sentence as a way of improving the readability of the compression it generates. Experiments found that our approach works reasonably well compared to the state-of-the-art system [Knight, K., & Marcu, D. (2002). Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence 139, 91–107.].  相似文献   

6.
In Mongolian, two different alphabets are used, Cyrillic and Mongolian. In this paper, we focus solely on the Mongolian language using the Cyrillic alphabet, in which a content word can be inflected when concatenated with one or more suffixes. Identifying the original form of content words is crucial for natural language processing and information retrieval. We propose a lemmatization method for Mongolian. The advantage of our lemmatization method is that it does not rely on noun dictionaries, enabling us to lemmatize out-of-dictionary words. We also apply our method to indexing for information retrieval. We use newspaper articles and technical abstracts in experiments that show the effectiveness of our method. Our research is the first significant exploration of the effectiveness of lemmatization for information retrieval in Mongolian.  相似文献   

7.
The paper presents two approaches to interactively refining user search formulations and their evaluation in the new High Accuracy Retrieval from Documents (HARD) track of TREC-12. The first method consists of asking the user to select a number of sentences that represent documents. The second method consists of showing to the user a list of noun phrases extracted from the initial document set. Both methods then expand the query based on the user feedback. The TREC results show that one of the methods is an effective means of interactive query expansion and yields significant performance improvements. The paper presents a comparison of the methods and detailed analysis of the evaluation results.  相似文献   

8.
Named Entity Recognition (NER) aims to automatically extract specific entities from the unstructured text. Compared with performing NER in English, Chinese NER is more challenging in recognizing entity boundaries because there are no explicit delimiters between Chinese characters. However, most previous researches focused on the semantic information of the Chinese language on the character level but ignored the importance of the phonetic characteristics. To address these issues, we integrated phonetic features of Chinese characters with the lexicon information to help disambiguate the entity boundary recognition by fully exploring the potential of Chinese as a pictophonetic language. In addition, a novel multi-tagging-scheme learning method was proposed, based on the multi-task learning paradigm, to alleviate the data sparsity and error propagation problems that occurred in the previous tagging schemes, by separately annotating the segmentation information of entities and their corresponding entity types. Extensive experiments performed on four Chinese NER benchmark datasets: OntoNotes4.0, MSRA, Resume, and Weibo, show that our proposed method consistently outperforms the existing state-of-the-art baseline models. The ablation experiments further demonstrated that the introduction of the phonetic feature and the multi-tagging-scheme has a significant positive effect on the improvement of the Chinese NER task.  相似文献   

9.
This article introduces a type of uncertainty that resides in textual information and requires epistemic interpretation on the information seeker’s part. Epistemic modality, as defined in linguistics and natural language processing, is a writer’s estimation of the validity of propositional content in texts. It is an evaluation of chances that a certain hypothetical state of affairs is true, e.g., definitely true or possibly true. This research shifts attention from the uncertainty–certainty dichotomy to a gradient epistemic continuum of absolute, high, moderate, low certainty, and uncertainty. An analysis of a New York Times dataset showed that epistemically modalized statements are pervasive in news discourse and they occur at a significantly higher rate in editorials than in news reports. Four independent annotators were able to recognize a gradation on the continuum but individual perceptions of the boundaries between levels were highly subjective. Stricter annotation instructions and longer coder training improved intercoder agreement results. This paper offers an interdisciplinary bridge between research in linguistics, natural language processing, and information seeking with potential benefits to design and implementation of information systems for situations where large amounts of textual information are screened manually on a regular basis, for instance, by professional intelligence or business analysts.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号