首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
[目的/意义]现有新闻文档实体排序研究大多以文档或实体为中心,如文本分类、实体链接等,关注实体在文本中的重要性的研究较少,本研究探讨基于重要性的新闻文档实体排序。[方法/过程]给定一篇文档,判断文档中实体相对文档而言的重要性,并基于此对实体进行排序。在搜狗全网新闻数据集上进行实验,并利用NDCG和逆序对比率两个指标对实体排序结果进行评价。[结果/结论]实验结果表明,基于实体频率、TF*IDF、信息熵、TextRank等的方法以及集成方法都达到了较好的效果,基于聚集系数的方法效果一般。其中基于TF*IDF的方法NDCG值为95.86%,是该指标下的最好结果;基于集成方法的逆序对比率值为84.46%,是该指标下的最好结果。  相似文献   

2.
《Knowledge Acquisition》1992,4(1):127-161
This paper reports on an investigation into a formal language for specifying KADS models of expertise. After arguing the need for and the use of such formal representations, we discuss each of the layers of a KADS model of expertise in the subsequent sections, and define the formal constructions that we use to represent the KADS entities at every layer: order-sorted logic at the domain layer, meta-logic at the inference layer and dynamic-logic at the task layer. All these constructions together make up (ML)2, the language that we use to represent models of expertise. We illustrate the use of (ML)2 in a small example model. We conclude by describing our experience to date with constructing such formal models in (ML)2, and by discussing some open problems that remain for future work.  相似文献   

3.
The current study has two objectives. First, we explore the characteristics of biological entities, such as drugs, and their side effects using an author–entity pair bipartite network. Second, we use the constructed network to examine whether there are outstanding features of relations between drugs and side effects. We extracted drug and side effect names from 169,766 PubMed abstracts published between 2010 to 2014 and constructed author–entity pair bipartite networks after ambiguous author names were processed. We propose a new ranking algorithm that takes into consideration the characteristics of bipartite networks to identify top-ranked biological drug and side effect pairs. To investigate the relationship between a particular drug and a side effect, we compared the drug and side effect pairs obtained from the network containing both drug and side effect with those observed in SIDER, a human expert-curated database. The results of this study indicate that our approach was able to identify a wide range of patterns of drug–side effect relations from the perspective of authors’ research interests. Further, our approach also identified the unique characteristics of the relation of biomedical entities obtained using an author–entity pair bipartite network.  相似文献   

4.
This article describes work done at Indiana University to “batch load” data from MARC bibliographic and authority records into the Work-based and FRBR-like Variations system. A series of experiments to iteratively refine our batch loading algorithm is described, along with details of how the algorithm identifies Works, creates relationships between entities, and maps a large amount of data from MARC into Variations records. The article closes with a discussion of the potential impact of this work on Variations project workflow and community FRBRization activities.  相似文献   

5.
考察特定领域文本中蕴含的细粒度知识实体的使用情况,对知识实体的评估和选择具有重要意义。学术文本中的细粒度知识实体通常具有多个类型、多种关联关系,挖掘知识实体的同质与异质关联关系,有助于深入了解特定领域知识实体的实际使用情况。目前相关研究大多针对学术文本中单一知识实体的抽取和评估,缺乏对知识实体间关系的关注,在一定程度上限制了基于实体抽取进行知识发现的能力。文章以自然语言处理领域为例,对学术论文全文中的细粒度知识实体关联数据进行挖掘,并通过可视化方式揭示关联数据中蕴含的信息。主要是选取全国计算语言学会议2009-2018年间收录的中文论文为原始语料,人工标注论文中使用的知识实体,并针对NLP特点将其细分为“指标实体”“工具实体”“资源实体”“方法实体”4种类型;结合关联规则挖掘算法Apriori和复杂网络分析软件构建知识实体关联网络,揭示该领域常用的知识实体,以及这些知识实体的使用相关性。  相似文献   

6.
In the field of scientometrics, impact indicators and ranking algorithms are frequently evaluated using unlabelled test data comprising relevant entities (e.g., papers, authors, or institutions) that are considered important. The rationale is that the higher some algorithm ranks these entities, the better its performance. To compute a performance score for an algorithm, an evaluation measure is required to translate the rank distribution of the relevant entities into a single-value performance score. Until recently, it was simply assumed that taking the average rank (of the relevant entities) is an appropriate evaluation measure when comparing ranking algorithms or fine-tuning algorithm parameters.With this paper we propose a framework for evaluating the evaluation measures themselves. Using this framework the following questions can now be answered: (1) which evaluation measure should be chosen for an experiment, and (2) given an evaluation measure and corresponding performance scores for the algorithms under investigation, how significant are the observed performance differences?Using two publication databases and four test data sets we demonstrate the functionality of the framework and analyse the stability and discriminative power of the most common information retrieval evaluation measures. We find that there is no clear winner and that the performance of the evaluation measures is highly dependent on the underlying data. Our results show that the average rank is indeed an adequate and stable measure. However, we also show that relatively large performance differences are required to confidently determine if one ranking algorithm is significantly superior to another. Lastly, we list alternative measures that also yield stable results and highlight measures that should not be used in this context.  相似文献   

7.
In this article we present Supervised Semantic Indexing which defines a class of nonlinear (quadratic) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained from a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as cross-language retrieval or online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, correlated feature hashing and sparsification. We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.  相似文献   

8.
We present a novel algorithm to rank smaller academic entities such as university departments or research groups within a research discipline. The Weighted Top Candidate (WTC) algorithm is a generalisation of an expert identification method. The axiomatic characterisation of WTC shows why it is especially suitable for scientometric purposes. The key axiom is stability – the selected institutions support each other's membership. The WTC algorithm, upon receiving an institution citation matrix, produces a list of institutions that can be deemed experts of the field. With a parameter we can adjust how exclusive our list should be. By completely relaxing the parameter, we obtain the largest stable set – academic entities that can qualify as experts under the mildest conditions. With a strict setup, we obtain a short list of the absolute elite. We demonstrate the algorithm on a citation database compiled from game theoretic literature published between 2008–2017. By plotting the size of the stable sets with respect to exclusiveness, we can obtain an overview of the competitiveness of the field. The diagram hints at how difficult it is for an institution to improve its position.  相似文献   

9.
The deployment of Web 2.0 technologies has led to rapid growth of various opinions and reviews on the web, such as reviews on products and opinions about people. Such content can be very useful to help people find interesting entities like products, businesses and people based on their individual preferences or tradeoffs. Most existing work on leveraging opinionated content has focused on integrating and summarizing opinions on entities to help users better digest all the opinions. In this paper, we propose a different way of leveraging opinionated content, by directly ranking entities based on a user’s preferences. Our idea is to represent each entity with the text of all the reviews of that entity. Given a user’s keyword query that expresses the desired features of an entity, we can then rank all the candidate entities based on how well opinions on these entities match the user’s preferences. We study several methods for solving this problem, including both standard text retrieval models and some extensions of these models. Experiment results on ranking entities based on opinions in two different domains (hotels and cars) show that the proposed extensions are effective and lead to improvement of ranking accuracy over the standard text retrieval models for this task.  相似文献   

10.
基于文本挖掘机制的区域经济关系分析   总被引:1,自引:0,他引:1  
已有的经济关系研究大都采用实证的或单纯的计量学的方法来实现的.本文则针对非结构化的文本特点,采用信息抽取和文本挖掘方法挖掘用户感兴趣的区域经济关系是具有十分重大应用价值的研究课题.本文在探讨了基于实体关系的文本挖掘机制的基础上,对31个省、市、自治区的区域经济关系进行了分析.运用文本挖掘技术对经济关系的挖掘包括两种方式:一是基于属性的经济关系挖掘,利用信息抽取获取各个实体属性,采用聚类方法分析经济实体关系;二是基于相互引用的经济关系挖掘,首先构造经济实体关系分类词典,提出了实体关系标注算法,利用信息抽取获得实体之间的引用情况,然后构造关系有向图,从中挖掘区域经济之间的关系.研究表明,运用文本挖掘技术,既可以对各个区域经济发展状况进行分析和评价,也可以发现特定区域经济之间的内在关系.  相似文献   

11.
陈曦  陈华钧  张文 《情报工程》2017,3(1):026-034
知识图谱(Knowledge Graph,简称KG)的表示学习方法旨在将知识图谱的实体和关系表示为稠密低维实值向量, 进而在低维向量空间中高效计算实体、关系及其之间的复杂语义关联, 在知识图谱的构建、推理、融合、挖掘以及应用中具有重要作用。已有的知识图谱表示方法仅仅考虑了知识图谱中的直接事实,忽略了知识图谱中一些隐藏的语义信息,这些语义信息对于知识图谱关系和实体的嵌入表示有着重要的影响。本文提出了一种规则增强的知识图谱表示学习方法,该方法首先通过知识图谱规则挖掘的方法提取一组可代表知识图谱语义信息的Horn 逻辑规则,随后通过基于规则的物化推理方法将相应的隐藏语义信息注入到知识图谱表示学习模型中。实验结果表明,基于规则增强的方法可以显著提升已有知识图谱表示学习模型在链接预测和定理预测上的效果和性能。  相似文献   

12.
Enterprise search is important, and the search quality has a direct impact on the productivity of an enterprise. Enterprise data contain both structured and unstructured information. Since these two types of information are complementary and the structured information such as relational databases is designed based on ER (entity-relationship) models, there is a rich body of information about entities in enterprise data. As a result, many information needs of enterprise search center around entities. For example, a user may formulate a query describing a problem that she encounters with an entity, e.g., the web browser, and want to retrieve relevant documents to solve the problem. Intuitively, information related to the entities mentioned in the query, such as related entities and their relations, would be useful to reformulate the query and improve the retrieval performance. However, most existing studies on query expansion are term-centric. In this paper, we propose a novel entity-centric query expansion framework for enterprise search. Specifically, given a query containing entities, we first utilize both unstructured and structured information to find entities that are related to the ones in the query. We then discuss how to adapt existing feedback methods to use the related entities and their relations to improve search quality. Experimental results over two real-world enterprise collections show that the proposed entity-centric query expansion strategies are more effective and robust to improve the search performance than the state-of-the-art pseudo feedback methods for long natural language-like queries with entities. Moreover, results over a TREC ad hoc retrieval collections show that the proposed methods can also work well for short keyword queries in the general search domain.  相似文献   

13.
Never before in history has mankind produced and had access to so much data, information, knowledge, and expertise as today. To organize, access, and manage these valuable assets effectively, we use taxonomies, classification hierarchies, ontologies, controlled vocabularies, and other approaches. We create directory structures for our files. We use organizational hierarchies to structure our work environment. However, the design and continuous update of these organizational schemas with potentially thousands of class nodes organizing millions of entities is challenging for any human being.The taxonomy visualization and validation (TV) tool introduced in this paper supports the semi-automatic validation and optimization of organizational schemas such as file directories, classification hierarchies, taxonomies, or other structures imposed on a data set for organization, access, and naming. By showing the “goodness of fit” for a schema and the potentially millions of entities it organizes, the TV tool eases the identification and reclassification of misclassified information entities, the identification of classes that grow too large, the evaluation of the size and homogeneity of existing classes, the examination of the “well-formedness” of an organizational schema, and more. As a demonstration, the TV tool is applied to display and examine the United States Patent and Trademark Office patent classification, which organizes more than three million patents into about 160,000 distinct patent classes. The paper concludes with a discussion and an outlook to future work.  相似文献   

14.
综述命名实体识别与翻译研究现状,提出基于信息抽取的命名实体识别与翻译方法,以及对该方法进行一系列集成优化处理,并实现了基于命名实体识别与翻译的跨语言信息检索实验。实验结果显示出命名实体识别与翻译在跨语言信息检索中的重要性,并证明了所提出的翻译加权和网络挖掘未登录命名实体方法的应用能显著提高跨语言信息检索的性能。  相似文献   

15.
A machine learning approach to sentiment analysis in multilingual Web texts   总被引:1,自引:0,他引:1  
Sentiment analysis, also called opinion mining, is a form of information extraction from text of growing research and commercial interest. In this paper we present our machine learning experiments with regard to sentiment analysis in blog, review and forum texts found on the World Wide Web and written in English, Dutch and French. We train from a set of example sentences or statements that are manually annotated as positive, negative or neutral with regard to a certain entity. We are interested in the feelings that people express with regard to certain consumption products. We learn and evaluate several classification models that can be configured in a cascaded pipeline. We have to deal with several problems, being the noisy character of the input texts, the attribution of the sentiment to a particular entity and the small size of the training set. We succeed to identify positive, negative and neutral feelings to the entity under consideration with ca. 83% accuracy for English texts based on unigram features augmented with linguistic features. The accuracy results of processing the Dutch and French texts are ca. 70 and 68% respectively due to the larger variety of the linguistic expressions that more often diverge from standard language, thus demanding more training patterns. In addition, our experiments give us insights into the portability of the learned models across domains and languages. A substantial part of the article investigates the role of active learning techniques for reducing the number of examples to be manually annotated.  相似文献   

16.
Abstract

This paper serves as a basic reference tool especially for catalogers of early cartographic resources, providing an extensive, annotated list of bibliographic references that are useful for cataloging-related cartobibliographical detective work. The entries are categorized as biographical sources, library bibliographies and catalogs, regional cartobibliographies by place of publication or geographic area of map coverage, thematic cartobibliographies, indexes to monographic and serial publications, comparative studies and introductory texts accompanying facsimile publications, basic bibliographies on the history of cartography, and cartographic periodicals. It provides examples of these various types of sources and comments on their helpful features. While the selections focus on English language resources and maps of North America, it suggests leads to find additional sources pertaining to particular unmet needs that may arise.  相似文献   

17.
In collaborative content generation (CCG), such as publishing scientific articles, a group of contributors collaboratively generates artifacts available through a venue. The main concern in such systems is the quality. A remarkable range of research considers quality metrics partially when dealing with the quality of artifacts, contributors, and venues. However, such approaches have several drawbacks. One of the most notable ones is that they are not comprehensive in terms of the metrics to evaluate all entities, including artifacts, contributors, and venues. Also, they are vulnerable to potential attacks.In this paper, we propose a novel iterative definition in which the quality of artifacts, collaborators, and venues are defined interconnectedly. In our framework, the quality of an artifact is defined based on the quality of its contributors, venue, references, and citations. The quality of a contributor is defined based on the quality of his artifacts, collaborators, and the venues. Quality of a venue is defined based on both quality of artifacts and contributors. We propose a data model, formulations, and an algorithm for the proposed approach. We also compare the robustness of our approach against malicious manipulations with two well-known related approaches. The comparison results show the superiority of our method over other related approaches.  相似文献   

18.
In the summer of 2013, the United Nations and NBC began a season-long collaborative campaign involving the primetime television series Revolution (2012–2014), a show about the global loss of electricity, to promote the former’s energy resource campaigns. The two entities collaboratively produced various texts and events encouraging audiences to learn more about United Nations energy initiatives and how people throughout the world lack consistent access to electricity. This essay offers a close, rhetorical reading of the collaboration’s paratexts, examining stated responses from actors, creators, interviewers, and panel participants within this content. In particular, I argue that contact between the paratexts and the “formative” text (that of the show’s narrative) can encourage viewers to think about electricity from the perspective of their own material practices, dependencies, and fears over losing the technological world. I examine how these invested viewers interpreted the United Nations’ efforts through such commitments. Naming a fictive world, and its feared loss, as metonymic of energy politics illustrates how meaning, emotion, and texts circulate, while also implicating the use of celebrity platforms for sociopolitical issues such as energy access.  相似文献   

19.
“十一五”期间我国文献情报领域知识发现研究综述   总被引:1,自引:0,他引:1  
对近年来关于知识发现的大量相关论文从概念关系辨析、知识发现方法体系、文本挖掘与文本趋势挖掘、非相关文献知识发现、数据挖掘研究拓展等方面开展研究,总结“十一五”期间我国文献情报领域知识发现研究成果,重点介绍有关知识发现的内容分析、关联理论、领域驱动、可视化、文本挖掘模型等研究进展,最后分析展望今后该研究领域的研究热点和研究方向。  相似文献   

20.
Prediction of the future performance of academic journals is a task that can benefit a variety of stakeholders including editorial staff, publishers, indexing services, researchers, university administrators and granting agencies. Using historical data on journal performance, this can be framed as a machine learning regression problem. In this work, we study two such regression tasks: 1) prediction of the number of citations a journal will receive during the next calendar year, and 2) prediction of the Elsevier CiteScore a journal will be assigned for the next calendar year. To address these tasks, we first create a dataset of historical bibliometric data for journals indexed in Scopus. We propose the use of neural network models trained on our dataset to predict the future performance of journals. To this end, we perform feature selection and model configuration for a Multi-Layer Perceptron and a Long Short-Term Memory. Through experimental comparisons to heuristic prediction baselines and classical machine learning models, we demonstrate superior performance in our proposed models for the prediction of future citation and CiteScore values.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号