首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
The main purpose of topic detection and tracking (TDT) is to detect, group, and organize newspaper articles reporting on the same event. Since an event is a reported occurrence at a specific time and place and the unavoidable consequences, TDT can benefit from an explicit use of time and place information. In this work, we focused on place information, using time information as in the previous research. News articles were analyzed for their characteristics of place information, and a new topic tracking method was proposed to incorporate the analysis results on place information. Experiments show that appropriate use of place information extracted automatically from news articles indeed helps event tracking that identify news articles reporting on the same events.  相似文献   

2.
海量的网络媒体信息使得人们在有限的时间内难以全面地掌握一些话题的信息,这样容易导致部分重要信息的遗漏。话题检测与追踪技术正是在这种需求下产生的。这种技术可以从庞大的信息集合中快速准确地获取人们感兴趣的内容。近几年,话题检测与追踪技术已成为自然语言处理领域热门的研究方向,它能把大量的信息有效地组织起来,并使用相关技术从中挖掘出有用的信息,用简洁有效的方式让人们了解一个事件或现象中所有细节以及它们之间的相关性。对话题跟踪的研究背景、相关概念、评测方法以及相关技术进行了综述,并总结了当前的相关技术。  相似文献   

3.
In this paper, we present a topic discovery system aimed to reveal the implicit knowledge present in news streams. This knowledge is expressed as a hierarchy of topic/subtopics, where each topic contains the set of documents that are related to it and a summary extracted from these documents. Summaries so built are useful to browse and select topics of interest from the generated hierarchies. Our proposal consists of a new incremental hierarchical clustering algorithm, which combines both partitional and agglomerative approaches, taking the main benefits from them. Finally, a new summarization method based on Testor Theory has been proposed to build the topic summaries. Experimental results in the TDT2 collection demonstrate its usefulness and effectiveness not only as a topic detection system, but also as a classification and summarization tool.  相似文献   

4.
Warning: This paper contains abusive samples that may cause discomfort to readers.Abusive language on social media reinforces prejudice against an individual or a specific group of people, which greatly hampers freedom of expression. With the rise of large-scale pre-trained language models, classification based on pre-trained language models has gradually become a paradigm for automatic abusive language detection. However, the effect of stereotypes inherent in language models on the detection of abusive language remains unknown, although this may further reinforce biases against the minorities. To this end, in this paper, we use multiple metrics to measure the presence of bias in language models and analyze the impact of these inherent biases in automatic abusive language detection. On the basis of this quantitative analysis, we propose two different debiasing strategies, token debiasing and sentence debiasing, which are jointly applied to reduce the bias of language models in abusive language detection without degrading the classification performance. Specifically, for the token debiasing strategy, we reduce the discrimination of the language model against protected attribute terms of a certain group by random probability estimation. For the sentence debiasing strategy, we replace protected attribute terms and augment the original text by counterfactual augmentation to obtain debiased samples, and use the consistency regularization between the original data and the augmented samples to eliminate the bias at the sentence level of the language model. The experimental results confirm that our method can not only reduce the bias of the language model in the abusive language detection task, but also effectively improve the performance of abusive language detection.  相似文献   

5.
周莉  王珏 《科技广场》2008,(3):75-76
本文从数据库设计的角度出发,对XML数据的约束进行深入的研究,提出了基于路径表达式和树模型的XML函数依赖。对于利用XML实现异构数据库的交换技术上,不仅在理论上具有指导价值,而且在实际应用中也具有现实意义和实用价值。  相似文献   

6.
Centroid-based summarization of multiple documents   总被引:2,自引:0,他引:2  
We present a multi-document summarizer, MEAD, which generates summaries using cluster centroids produced by a topic detection and tracking system. We describe two new techniques, a centroid-based summarizer, and an evaluation scheme based on sentence utility and subsumption. We have applied this evaluation to both single and multiple document summaries. Finally, we describe two user studies that test our models of multi-document summarization.  相似文献   

7.
Within the context of Information Extraction (IE), relation extraction is oriented towards identifying a variety of relation phrases and their arguments in arbitrary sentences. In this paper, we present a clause-based framework for information extraction in textual documents. Our framework focuses on two important challenges in information extraction: 1) Open Information Extraction and (OIE), and 2) Relation Extraction (RE). In the plethora of research that focus on the use of syntactic and dependency parsing for the purposes of detecting relations, there has been increasing evidence of incoherent and uninformative extractions. The extracted relations may even be erroneous at times and fail to provide a meaningful interpretation. In our work, we use the English clause structure and clause types in an effort to generate propositions that can be deemed as extractable relations. Moreover, we propose refinements to the grammatical structure of syntactic and dependency parsing that help reduce the number of incoherent and uninformative extractions from clauses. In our experiments both in the open information extraction and relation extraction domains, we carefully evaluate our system on various benchmark datasets and compare the performance of our work against existing state-of-the-art information extraction systems. Our work shows improved performance compared to the state-of-the-art techniques.  相似文献   

8.
Probabilistic topic models are unsupervised generative models which model document content as a two-step generation process, that is, documents are observed as mixtures of latent concepts or topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingual settings. Novel topic models have been designed to work with parallel and comparable texts. We define multilingual probabilistic topic modeling (MuPTM) and present the first full overview of the current research, methodology, advantages and limitations in MuPTM. As a representative example, we choose a natural extension of the omnipresent LDA model to multilingual settings called bilingual LDA (BiLDA). We provide a thorough overview of this representative multilingual model from its high-level modeling assumptions down to its mathematical foundations. We demonstrate how to use the data representation by means of output sets of (i) per-topic word distributions and (ii) per-document topic distributions coming from a multilingual probabilistic topic model in various real-life cross-lingual tasks involving different languages, without any external language pair dependent translation resource: (1) cross-lingual event-centered news clustering, (2) cross-lingual document classification, (3) cross-lingual semantic similarity, and (4) cross-lingual information retrieval. We also briefly review several other applications present in the relevant literature, and introduce and illustrate two related modeling concepts: topic smoothing and topic pruning. In summary, this article encompasses the current research in multilingual probabilistic topic modeling. By presenting a series of potential applications, we reveal the importance of the language-independent and language pair independent data representations by means of MuPTM. We provide clear directions for future research in the field by providing a systematic overview of how to link and transfer aspect knowledge across corpora written in different languages via the shared space of latent cross-lingual topics, that is, how to effectively employ learned per-topic word distributions and per-document topic distributions of any multilingual probabilistic topic model in various cross-lingual applications.  相似文献   

9.
We deal with the task of authorship attribution, i.e. identifying the author of an unknown document, proposing the use of Part Of Speech (POS) tags as features for language modeling. The experimentation is carried out on corpora untypical for the task, i.e., with documents edited by non-professional writers, such as movie reviews or tweets. The former corpus is homogeneous with respect to the topic making the task more challenging, The latter corpus, puts language models into a framework of a continuously and fast evolving language, unique and noisy writing style, and limited length of social media messages. While we find that language models based on POS tags are competitive in only one of the corpora (movie reviews), they generally provide efficiency benefits and robustness against data sparsity. Furthermore, we experiment with model fusion, where language models based on different modalities are combined. By linearly combining three language models, based on characters, words, and POS trigrams, respectively, we achieve the best generalization accuracy of 96% on movie reviews, while the combination of language models based on characters and POS trigrams provides 54% accuracy on the Twitter corpus. In fusion, POS language models are proven essential effective components.  相似文献   

10.
Language modeling (LM), providing a principled mechanism to associate quantitative scores to sequences of words or tokens, has long been an interesting yet challenging problem in the field of speech and language processing. The n-gram model is still the predominant method, while a number of disparate LM methods, exploring either lexical co-occurrence or topic cues, have been developed to complement the n-gram model with some success. In this paper, we explore a novel language modeling framework built on top of the notion of relevance for speech recognition, where the relationship between a search history and the word being predicted is discovered through different granularities of semantic context for relevance modeling. Empirical experiments on a large vocabulary continuous speech recognition (LVCSR) task seem to demonstrate that the various language models deduced from our framework are very comparable to existing language models both in terms of perplexity and recognition error rate reductions.  相似文献   

11.
Semantic representation reflects the meaning of the text as it may be understood by humans. Thus, it contributes to facilitating various automated language processing applications. Although semantic representation is very useful for several applications, a few models were proposed for the Arabic language. In that context, this paper proposes a graph-based semantic representation model for Arabic text. The proposed model aims to extract the semantic relations between Arabic words. Several tools and concepts have been employed such as dependency relations, part-of-speech tags, name entities, patterns, and Arabic language predefined linguistic rules. The core idea of the proposed model is to represent the meaning of Arabic sentences as a rooted acyclic graph. Textual entailment recognition challenge is considered in order to evaluate the ability of the proposed model to enhance other Arabic NLP applications. The experiments have been conducted using a benchmark Arabic textual entailment dataset, namely, ArbTED. The results proved that the proposed graph-based model is able to enhance the performance of the textual entailment recognition task in comparison to other baseline models. On average, the proposed model achieved 8.6%, 30.2%, 5.3% and 16.2% improvement in terms of accuracy, recall, precision, and F-score results, respectively.  相似文献   

12.
Hate speech is an increasingly important societal issue in the era of digital communication. Hateful expressions often make use of figurative language and, although they represent, in some sense, the dark side of language, they are also often prime examples of creative use of language. While hate speech is a global phenomenon, current studies on automatic hate speech detection are typically framed in a monolingual setting. In this work, we explore hate speech detection in low-resource languages by transferring knowledge from a resource-rich language, English, in a zero-shot learning fashion. We experiment with traditional and recent neural architectures, and propose two joint-learning models, using different multilingual language representations to transfer knowledge between pairs of languages. We also evaluate the impact of additional knowledge in our experiment, by incorporating information from a multilingual lexicon of abusive words. The results show that our joint-learning models achieve the best performance on most languages. However, a simple approach that uses machine translation and a pre-trained English language model achieves a robust performance. In contrast, Multilingual BERT fails to obtain a good performance in cross-lingual hate speech detection. We also experimentally found that the external knowledge from a multilingual abusive lexicon is able to improve the models’ performance, specifically in detecting the positive class. The results of our experimental evaluation highlight a number of challenges and issues in this particular task. One of the main challenges is related to the issue of current benchmarks for hate speech detection, in particular how bias related to the topical focus in the datasets influences the classification performance. The insufficient ability of current multilingual language models to transfer knowledge between languages in the specific hate speech detection task also remain an open problem. However, our experimental evaluation and our qualitative analysis show how the explicit integration of linguistic knowledge from a structured abusive language lexicon helps to alleviate this issue.  相似文献   

13.
14.
Conceptual metaphor detection is a well-researched topic in Natural Language Processing. At the same time, conceptual metaphor use analysis produces unique insight into individual psychological processes and characteristics, as demonstrated by research in cognitive psychology. Despite the fact that state-of-the-art language models allow for highly effective automatic detection of conceptual metaphor in benchmark datasets, the models have never been applied to psychological tasks. The benchmark datasets differ a lot from experimental texts recorded or produced in a psychological setting, in their domain, genre, and the scope of metaphoric expressions covered.We present the first experiment to apply NLP metaphor detection methods to a psychological task, specifically, analyzing individual differences. For that, we annotate MetPersonality, a dataset of Russian texts written in a psychological experiment setting, with conceptual metaphor. With a widely used conceptual metaphor annotation procedure, we obtain low annotation quality, which arises from the dataset characteristics uncommon in typical automatic metaphor detection tasks. We suggest a novel conceptual metaphor annotation procedure to mitigate issues in annotation quality, increasing the inter-annotator agreement to a moderately high level. We leverage the annotated dataset and existing metaphor datasets in Russian to select, train and evaluate state-of-the-art metaphor detection models, obtaining acceptable results in the metaphor detection task. In turn, the most effective model is used to detect conceptual metaphor automatically in RusPersonality, a larger dataset containing meta-information on psychological traits of the participant authors. Finally, we analyze correlations of automatically detected metaphor use with psychological traits encoded in the Freiburg Personality Inventory (FPI).Our pioneering work on automatically-detected metaphor use and individual differences demonstrates the possibility of unprecedented large-scale research on the relation between of metaphor use and personality traits and dispositions, cognitive and emotional processing.  相似文献   

15.
This paper presents a probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models, user preferences are modeled through loss functions, and retrieval is cast as a risk minimization problem. We discuss how this framework can unify existing retrieval models and accommodate systematic development of new retrieval models. As an example of using the framework to model non-traditional retrieval problems, we derive retrieval models for subtopic retrieval, which is concerned with retrieving documents to cover many different subtopics of a general query topic. These new models differ from traditional retrieval models in that they relax the traditional assumption of independent relevance of documents.  相似文献   

16.
作为话题检测与追踪和舆情监测中的一项基础性工作,识别并处理爆发词对突发检测具有重要的作用,本文综述了该领域目前的研究现状和已有的研究成果并对其进行比较分析,总结了其中亟待解决的关键问题并进行了重点探讨,为后续研究指明了方向。  相似文献   

17.
Recently, the high popularity of social networks accelerates the development of item recommendation. Integrating the influence diffusion of social networks in recommendation systems is a challenging task since topic distribution over users and items is latent and user topic interest may change over time. In this paper, we propose a dynamic generative model for item recommendation which captures the potential influence logs based on the community-level topic influence diffusion to infer the latent topic distribution over users and items. Our model enables tracking the time-varying distributions of topic interest and topic popularity over communities in social networks. A collapsed Gibbs sampling algorithm is proposed to train the model, and an improved diversification algorithm is proposed to obtain item diversified recommendation list. Extensive experiments are conducted to evaluate the effectiveness and efficiency of our method. The results validate our approach and show the superiority of our method compared with state-of-the-art diversified recommendation methods.  相似文献   

18.
Emerging topic detection is a vital research area for researchers and scholars interested in searching for and tracking new research trends and topics. The current methods of text mining and data mining used for this purpose focus only on the frequency of which subjects are mentioned, and ignore the novelty of the subject which is also critical, but beyond the scope of a frequency study. This work tackles this inadequacy to propose a new set of indices for emerging topic detection. They are the novelty index (NI) and the published volume index (PVI). This new set of indices is created based on time, volume, frequency and represents a resolution to provide a more precise set of prediction indices. They are then utilized to determine the detection point (DP) of new emerging topics. Following the detection point, the intersection decides the worth of a new topic. The algorithms presented in this paper can be used to decide the novelty and life span of an emerging topic in a specific field. The entire comprehensive collection of the ACM Digital Library is examined in the experiments. The application of the NI and PVI gives a promising indication of emerging topics in conferences and journals.  相似文献   

19.
20.
相关概念的关联参照检索是概念检索的重要研究内容。本文提出了一种基于主题的语义关联的参照检索模型,通过融合语义网、本体论的相关知识及信息提取等语言处理技术,提取关于特定主题的文档的主题概念及概念之间的关联构成该主题的语义关联模型,并辅助于参照检索过程。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号