首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Compact graphic display of phrases from the original text is among abstracting assistance features being prototyped in the TEXNET text network management system. Compaction is achieved by embedding subphrases and by enabling the user to select rapidly word by word. Phrases displayed would not necessarily be those selected for automatic indexing.  相似文献   

2.
A new method is described to extract significant phrases in the title and the abstract of scientific or technical documents. The method is based upon a text structure analysis and uses a relatively small dictionary. The dictionary has been constructed based on the knowledge about concepts in the field of science or technology and some lexical knowledge, for significant phrases and their component items may be used in different meanings among the fields. A text analysis approach has been applied to select significant phrases as substantial and semantic information carriers of the contents of the abstract.The results of the experiment for five sets of documents have shown that the significant phrases are effectively extracted in all cases, and the number of them for every document and the processing time is fairly satisfactory. The information representation of the document, partly using the method, is discussed with relation to the construction of the document information retrieval system.  相似文献   

3.
Social emotion refers to the emotion evoked to the reader by a textual document. In contrast to the emotion cause extraction task which analyzes the cause of the author's sentiments based on the expressions in text, identifying the causes of social emotion evoked to the reader from text has not been explored previously. Social emotion mining and its cause analysis is not only an important research topic in Web-based social media analytics and text mining but also has a number of applications in multiple domains. As the focus of social emotion cause identification is on analyzing the causes of the reader's emotions elicited by a text that are not explicitly or implicitly expressed, it is a challenging task fundamentally different from the previous research. To tackle this, it also needs a deeper level understanding of the cognitive process underlying the inference of social emotion and its cause analysis. In this paper, we propose the new task of social emotion cause identification (SECI). Inspired by the cognitive structure of emotions (OCC) theory, we present a Cognitive Emotion model Enhanced Sequential (CogEES) method for SECI. Specifically, based on the implications of the OCC model, our method first establishes the correspondence between words/phrases in text and emotional dimensions identified in OCC and builds the emotional dimension lexicons with 1,676 distinct words/phrases. Then, our method utilizes lexicons information and discourse coherence for the semantic segmentation of document and the enhancement of clause representation learning. Finally, our method combines text segmentation and clause representation into a sequential model for cause clause prediction. We construct the SECI dataset for this new task and conduct experiments to evaluate CogEES. Our method outperforms the baselines and achieves over 10% F1 improvement on average, with better interpretability of the prediction results.  相似文献   

4.
张文萍  黎春兰 《现代情报》2013,33(2):21-23,124
在分析现有文本表示法的基础之处,提出一种以段落、语句、词语为层次结构的文本表示方法——文本空间表示模型,并在此模型基础上探讨一种以文本段落为基本单位的相似文本计算算法,以实现相似文本检测目标。最后建立测试集并在测试集上执行检测实验,结果表明此方具有较好的相似文本发现效果。  相似文献   

5.
Abstractive summarization aims to generate a concise summary covering salient content from single or multiple text documents. Many recent abstractive summarization methods are built on the transformer model to capture long-range dependencies in the input text and achieve parallelization. In the transformer encoder, calculating attention weights is a crucial step for encoding input documents. Input documents usually contain some key phrases conveying salient information, and it is important to encode these phrases completely. However, existing transformer-based summarization works did not consider key phrases in input when determining attention weights. Consequently, some of the tokens within key phrases only receive small attention weights, which is not conducive to encoding the semantic information of input documents. In this paper, we introduce some prior knowledge of key phrases into the transformer-based summarization model and guide the model to encode key phrases. For the contextual representation of each token in the key phrase, we assume the tokens within the same key phrase make larger contributions compared with other tokens in the input sequence. Based on this assumption, we propose the Key Phrase Aware Transformer (KPAT), a model with the highlighting mechanism in the encoder to assign greater attention weights for tokens within key phrases. Specifically, we first extract key phrases from the input document and score the phrases’ importance. Then we build the block diagonal highlighting matrix to indicate these phrases’ importance scores and positions. To combine self-attention weights with key phrases’ importance scores, we design two structures of highlighting attention for each head and the multi-head highlighting attention. Experimental results on two datasets (Multi-News and PubMed) from different summarization tasks and domains show that our KPAT model significantly outperforms advanced summarization baselines. We conduct more experiments to analyze the impact of each part of our model on the summarization performance and verify the effectiveness of our proposed highlighting mechanism.  相似文献   

6.
In this paper, we propose a common phrase index as an efficient index structure to support phrase queries in a very large text database. Our structure is an extension of previous index structures for phrases and achieves better query efficiency with modest extra storage cost. Further improvement in efficiency can be attained by implementing our index according to our observation of the dynamic nature of common word set. In experimental evaluation, a common phrase index using 255 common words has an improvement of about 11% and 62% in query time for the overall and large queries (queries of long phrases) respectively over an auxiliary nextword index. Moreover, it has only about 19% extra storage cost. Compared with an inverted index, our improvement is about 72% and 87% for the overall and large queries respectively. We also propose to implement a common phrase index with dynamic update feature. Our experiments show that more improvement in time efficiency can be achieved.  相似文献   

7.
The fundamental idea of the work reported here is to extract index phrases from texts with the help of a single word concept dictionary and a thesaurus containing relations among concepts. The work is based on the fact, that, within every phrase, the single words the phrase is composed of are related in a certain well denned manner, the type of relations holding between concepts depending only on the concepts themselves. Therefore relations can be stored in a semantic network. The algorithm described extracts single word concepts from texts and combines them to phrases using the semantic relations between these concepts, which are stored in the network. The results obtained show that phrase extraction from texts by this semantic method is possible and offers many advantages over other (purely syntactic or statistic) methods concerning preciseness and completeness of the meaning representation of the text. But the results show, too, that some syntactic and morphologic “filtering” should be included for effectivity reasons.  相似文献   

8.
A Zipfian model of an automatic bibliographic system is developed using parameters describing the contents of it database and its inverted file. The underlying structure of the Zipf distribution is derived, with particular emphasis on its application to work frequencies, especially with regard to the inverted flies of an automatic bibliographic system. Andrew Booth developed a form of Zipf's law which estimates the number of words of a particular frequency for a given author and text. His formulation has been adopted as the basis of a model of term dispersion in an inverted file system. The model is also distinctive in its consideration of the proliferation of spelling errors in free text, and the inclusion of all searchable elements from the system's inverted file. This model is applied to the National Library of Medicine's MEDLINE. The model carries implications for the determination of database storage requirements, search response time, and search exhaustiveness.  相似文献   

9.
This work aims to extract possible causal relations that exist between noun phrases. Some causal relations are manifested by lexical patterns like causal verbs and their sub-categorization. We use lexical patterns as a filter to find causality candidates and we transfer the causality extraction problem to the binary classification. To solve the problem, we introduce probabilities for word pair and concept pair that could be part of causal noun phrase pairs. We also use the cue phrase probability that could be a causality pattern. These probabilities are learned from the raw corpus in an unsupervised manner. With this probabilistic model, we increase both precision and recall. Our causality extraction shows an F-score of 77.37%, which is an improvement of 21.14 percentage points over the baseline model. The long distance causal relation is extracted with the binary tree-styled cue phrase. We propose an incremental cue phrase learning method based on the cue phrase confidence score that was measured after each causal classifier learning step. A better recall of 15.37 percentage points is acquired after the cue phrase learning.  相似文献   

10.
Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The system learns the similarity matrix between word pairs from the unlabeled corpus, and it uses the vector representations of sense definitions from MRD, which are derived based on the similarity matrix. In order to disambiguate all occurrences of polysemous words in a sentence, the system separately constructs the acyclic weighted digraph (AWD) for every occurrence of polysemous words in a sentence. The AWD is structured based on consideration of the senses of context words which occur with a target word in a sentence. After building the AWD per each polysemous word, we can search the optimal path of the AWD using the Viterbi algorithm. We assign the most appropriate sense to the target word in sentences with the sense on the optimal path in the AWD. By experiments, our system shows 76.4% accuracy for the semantically ambiguous Korean words.  相似文献   

11.
Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The system learns the similarity matrix between word pairs from the unlabeled corpus, and it uses the vector representations of sense definitions from MRD, which are derived based on the similarity matrix. In order to disambiguate all occurrences of polysemous words in a sentence, the system separately constructs the acyclic weighted digraph (AWD) for every occurrence of polysemous words in a sentence. The AWD is structured based on consideration of the senses of context words which occur with a target word in a sentence. After building the AWD per each polysemous word, we can search the optimal path of the AWD using the Viterbi algorithm. We assign the most appropriate sense to the target word in sentences with the sense on the optimal path in the AWD. By experiments, our system shows 76.4% accuracy for the semantically ambiguous Korean words.  相似文献   

12.
This work addresses the information retrieval problem of auto-indexing Arabic documents. Auto-indexing a text document refers to automatically extracting words that are suitable for building an index for the document. In this paper, we propose an auto-indexing method for Arabic text documents. This method is mainly based on morphological analysis and on a technique for assigning weights to words. The morphological analysis uses a number of grammatical rules to extract stem words that become candidate index words. The weight assignment technique computes weights for these words relative to the container document. The weight is based on how spread is the word in a document and not only on its rate of occurrence. The candidate index words are then sorted in descending order by weight so that information retrievers can select the more important index words. We empirically verify the usefulness of our method using several examples. For these examples, we obtained an average recall of 46% and an average precision of 64%.  相似文献   

13.
Conglomerates as a general framework for informetric research   总被引:2,自引:0,他引:2  
We introduce conglomerates as a general framework for informetric (and other) research. A conglomerate consists of two collections: a finite source collection and a pool, and two mappings: a source-item map and a magnitude map. The ratio of the sum of all magnitudes of item-sets, and the number of elements in the source collection is called the conglomerate ratio. It is a kind of average, generalizing the notion of an impact factor. The source-item relation of a conglomerate leads to a list of sources ranked according to the magnitude of their corresponding item-sets. This list, called a Zipf list, is the basic ingredient for all considerations related to power laws and Lotkaian or Zipfian informetrics. Examples where this framework applies are: impact factors, including web impact factors, Bradford–Lotka type bibliographies, first-citation studies, word use, diffusion factors, elections and even bestsellers lists.  相似文献   

14.
面向核心依存分析,运用规则和统计相结合的方法,在指定谓词的前提下,识别该谓词所支配的介词短语。首先,根据介词和介词短语右边界的规律抽取搭配模板,从训练语料中自动提取搭配关系,并用这些搭配关系在一定的搭配策略下对介词短语进行识别。然后,用基于词性的边界选择模型和规则方法相结合的技术对其它介词短语进行识别。  相似文献   

15.
The use of natural language information can improve decision-making. Darwinian considerations suggest that language may have developed because it leads to improved decision-making and survival, justifying the study of language's contribution to decision-making. The study of information-based decision-making within the context of evolution provides a view of information use that allows us to both describe the phenomenon of information use as well as to explain why an information use occurs as it does. Increasing information retrieval performance using phrases and part-of-speech (POS) information is one example of a type of decision-making performance that is improved when using this linguistic information. By studying a set of phrases used in a text retrieval system, we are able to show the relative effectiveness of using multi-term phrases as opposed to individual terms, as well as the relative worth of POS tagged terms or phrases, as opposed to untagged terms or phrases. An explanation is suggested for why POS tags contribute less to higher order grammatical constructs. We propose a measure of those needs for POS disambiguation that can be addressed by tagging; some example terms are analyzed using this measure, and specific degrees of ambiguity are proposed.  相似文献   

16.
We consider a challenging clustering task: the clustering of multi-word terms without document co-occurrence information in order to form coherent groups of topics. For this task, we developed a methodology taking as input multi-word terms and lexico-syntactic relations between them. Our clustering algorithm, named CPCL is implemented in the TermWatch system. We compared CPCL to other existing clustering algorithms, namely hierarchical and partitioning (k-means, k-medoids). This out-of-context clustering task led us to adapt multi-word term representation for statistical methods and also to refine an existing cluster evaluation metric, the editing distance in order to evaluate the methods. Evaluation was carried out on a list of multi-word terms from the genomic field which comes with a hand built taxonomy. Results showed that while k-means and k-medoids obtained good scores on the editing distance, they were very sensitive to term length. CPCL on the other hand obtained a better cluster homogeneity score and was less sensitive to term length. Also, CPCL showed good adaptability for handling very large and sparse matrices.  相似文献   

17.
In this paper, we propose a new learning method for extracting bilingual word pairs from parallel corpora in various languages. In cross-language information retrieval, the system must deal with various languages. Therefore, automatic extraction of bilingual word pairs from parallel corpora with various languages is important. However, previous works based on statistical methods are insufficient because of the sparse data problem. Our learning method automatically acquires rules, which are effective to solve the sparse data problem, only from parallel corpora without any prior preparation of a bilingual resource (e.g., a bilingual dictionary, a machine translation system). We call this learning method Inductive Chain Learning (ICL). Moreover, the system using ICL can extract bilingual word pairs even from bilingual sentence pairs for which the grammatical structures of the source language differ from the grammatical structures of the target language because the acquired rules have the information to cope with the different word orders of source language and target language in local parts of bilingual sentence pairs. Evaluation experiments demonstrated that the recalls of systems based on several statistical approaches were improved through the use of ICL.  相似文献   

18.
This study attempted to use semantic relations expressed in text, in particular cause-effect relations, to improve information retrieval effectiveness. The study investigated whether the information obtained by matching cause-effect relations expressed in documents with the cause-effect relations expressed in users’ queries can be used to improve document retrieval results, in comparison to using just keyword matching without considering relations.An automatic method for identifying and extracting cause-effect information in Wall Street Journal text was developed. Causal relation matching was found to yield a small but significant improvement in retrieval results when the weights used for combining the scores from different types of matching were customized for each query. Causal relation matching did not perform better than word proximity matching (i.e. matching pairs of causally related words in the query with pairs of words that co-occur within document sentences), but the best results were obtained when causal relation matching was combined with word proximity matching. The best kind of causal relation matching was found to be one in which one member of the causal relation (either the cause or the effect) was represented as a wildcard that could match with any word.  相似文献   

19.
Latent Semantic Indexing (LSI) uses the singular value decomposition to reduce noisy dimensions and improve the performance of text retrieval systems. Preliminary results have shown modest improvements in retrieval accuracy and recall, but these have mainly explored small collections. In this paper we investigate text retrieval on a larger document collection (TREC) and focus on distribution of word norm (magnitude). Our results indicate the inadequacy of word representations in LSI space on large collections. We emphasize the query expansion interpretation of LSI and propose an LSI term normalization that achieves better performance on larger collections.  相似文献   

20.
Named entity recognition aims to detect pre-determined entity types in unstructured text. There is a limited number of studies on this task for low-resource languages such as Turkish. We provide a comprehensive study for Turkish named entity recognition by comparing the performances of existing state-of-the-art models on the datasets with varying domains to understand their generalization capability and further analyze why such models fail or succeed in this task. Our experimental results, supported by statistical tests, show that the highest weighted F1 scores are obtained by Transformer-based language models, varying from 80.8% in tweets to 96.1% in news articles. We find that Transformer-based language models are more robust to entity types with a small sample size and longer named entities compared to traditional models, yet all models have poor performance for longer named entities in social media. Moreover, when we shuffle 80% of words in a sentence to imitate flexible word order in Turkish, we observe more performance deterioration, 12% in well-written texts, compared to 7% in noisy text.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号