期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A domain-independent approach to finding related entities

Olga Vechtomova Stephen E. Robertson 《Information processing & management》2012

We propose an approach to the retrieval of entities that have a specific relationship with the entity given in a query. Our research goal is to investigate whether related entity finding problem can be addressed by combining a measure of relatedness of candidate answer entities to the query, and likelihood that the candidate answer entity belongs to the target entity category specified in the query. An initial list of candidate entities, extracted from top ranked documents retrieved for the query, is refined using a number of statistical and linguistic methods. The proposed method extracts the category of the target entity from the query, identifies instances of this category as seed entities, and computes similarity between candidate and seed entities. The evaluation was conducted on the Related Entity Finding task of the Entity Track of TREC 2010, as well as the QA list questions from TREC 2005 and 2006. Evaluation results demonstrate that the proposed methods are effective in finding related entities. 相似文献

2.

Effects of answer weight boosting in strategy-driven question answering

Hyo-Jung Oh Sung Hyon Myaeng 《Information processing & management》2012,48(1):83-93

With the advances in natural language processing (NLP) techniques and the need to deliver more fine-grained information or answers than a set of documents, various QA techniques have been developed corresponding to different question and answer types. A comprehensive QA system must be able to incorporate individual QA techniques as they are developed and integrate their functionality to maximize the system’s overall capability in handling increasingly diverse types of questions. To this end, a new QA method was developed to learn strategies for determining module invocation sequences and boosting answer weights for different types of questions. In this article, we examine the roles and effects of the answer verification and weight boosting method, which is the main core of the automatically generated strategy-driven QA framework, in comparison with a strategy-less, straightforward answer-merging approach and a strategy-driven but with manually constructed strategies. 相似文献

3.

An exploration of ranking models and feedback method for related entity finding

Xitong Liu Wei Zheng Hui Fang 《Information processing & management》2013

Most existing search engines focus on document retrieval. However, information needs are certainly not limited to finding relevant documents. Instead, a user may want to find relevant entities such as persons and organizations. In this paper, we study the problem of related entity finding. Our goal is to rank entities based on their relevance to a structured query, which specifies an input entity, the type of related entities and the relation between the input and related entities. We first discuss a general probabilistic framework, derive six possible retrieval models to rank the related entities, and then compare these models both analytically and empirically. To further improve performance, we study the problem of feedback in the context of related entity finding. Specifically, we propose a mixture model based feedback method that can utilize the pseudo feedback entities to estimate an enriched model for the relation between the input and related entities. Experimental results over two standard TREC collections show that the derived relation generation model combined with a relation feedback method performs better than other models. 相似文献

4.

Linguistic kernels for answer re-ranking in question answering systems

Alessandro Moschitti Silvia Quarteroni 《Information processing & management》2011

Answer selection is the most complex phase of a question answering (QA) system. To solve this task, typical approaches use unsupervised methods such as computing the similarity between query and answer, optionally exploiting advanced syntactic, semantic or logic representations. 相似文献

5.

Compositional question answering: A divide and conquer approach

Hyo-Jung Oh Ki-Youn Sung Myung-Gil Jang Sung Hyon Myaeng 《Information processing & management》2011

This paper describes how questions can be characterized for question answering (QA) along different facets and focuses on questions that cannot be answered directly but can be divided into simpler ones so that they can be answered directly using existing QA capabilities. Since individual answers are composed to generate the final answer, we call this process as compositional QA. The goal of the proposed QA method is to answer a composite question by dividing it into atomic ones, instead of developing an entirely new method tailored for the new question type. A question is analyzed automatically to determine its class, and its sub-questions are sent to the relevant QA modules. Answers returned from the individual QA modules are composed based on the predetermined plan corresponding to the question type. The experimental results based on 615 questions show that the compositional QA approach outperforms the simple routing method by about 17%. Considering 115 composite questions only, the F-score was almost tripled from the baseline. 相似文献

6.

An in-depth study of similarity predicate committee

Jia Zhu Gabriel Pui Cheong Fung Zeyang Lei Min Yang Ying Shen 《Information processing & management》2019,56(3):381-393

In the last decades, many similarity measures are proposed, such as Jaccard coefficient, cosine similarity, BM25, language model, etc. Despite the effectiveness of the existing similarity measures, we observe that none of them can consistently outperform the others in most typical situations. Choosing which similarity predicate to use is usually treated as an empirical question by evaluating a particular task with a number of different similarity predicates, which is not computationally efficient and the obtained results are not portable. In this paper, we propose a novel approach to combine different similarity predicates together to form a committee so that we do not need to worry about choosing which of them to use. Empirically, we can obtain a better result than any individual similarity predicate, which is quite meaningful in practice. Specifically, our method models the problem of committee generation as a 0–1 integer programming problem based on the confidence of similarity predicates and the reliability of attributes. We demonstrate the effectiveness of our model by applying it on three datasets with controlled errors. Experimental results demonstrate that our similarity predicate committee is more robust and superior over existing individual similarity predicates. 相似文献

7.

Named entity recognition in Turkish: A comparative study with detailed error analysis

《Information processing & management》2022,59(6):103065

Named entity recognition aims to detect pre-determined entity types in unstructured text. There is a limited number of studies on this task for low-resource languages such as Turkish. We provide a comprehensive study for Turkish named entity recognition by comparing the performances of existing state-of-the-art models on the datasets with varying domains to understand their generalization capability and further analyze why such models fail or succeed in this task. Our experimental results, supported by statistical tests, show that the highest weighted F1 scores are obtained by Transformer-based language models, varying from 80.8% in tweets to 96.1% in news articles. We find that Transformer-based language models are more robust to entity types with a small sample size and longer named entities compared to traditional models, yet all models have poor performance for longer named entities in social media. Moreover, when we shuffle 80% of words in a sentence to imitate flexible word order in Turkish, we observe more performance deterioration, 12% in well-written texts, compared to 7% in noisy text. 相似文献

8.

基于本体的文本信息检索研究 总被引：5，自引：0，他引：5

杨建林《情报理论与实践》2006,29(5):598-601

本文对如何构建基于本体的文本信息检索系统进行了探讨．并认为，利用反映概念之间关系的领域本体指导主题标引，利用反映实体之间关系的领域本体指导实体关系标引，并以本体的形式表示文档替代物和查询表达式，可以进一步提高文本信息检索系统的性能。相似文献

9.

XML data exchange with target constraints

Zijing Tan Liyong Zhang Wei Wang Baile Shi 《Information processing & management》2013

Data exchange is the problem of taking data structured under a source schema and creating an instance of a target schema, by following a mapping between the two schemas. There is a rich literature on problems related to data exchange, e.g., the design of a schema mapping language, the consistency of schema mappings, operations on mappings, and query answering over mappings. Data exchange is extensively studied on relational model, and is also recently discussed for XML data. This article investigates the construction of target instance for XML data exchange, which has received far less attention. We first present a rich language for the definition of schema mappings, which allow one to use various forms of document navigation and specify conditions on data values. Given a schema mapping, we then provide an algorithm to construct a canonical target instance. The schema mapping alone is not adequate for expressing target semantics, and hence, the canonical instance is in general not optimal. We recognize that target constraints play a crucial role in the generation of good solutions. In light of this, we employ a general XML constraint model to define target constraints. Structural constraints and keys are used to identify a certain entity, as rules for data merging. Moreover, we develop techniques to enforce non-key constraints on the canonical target instance, by providing a chase method to reason about data. Experimental results show that our algorithms scale well, and are effective in producing target instances of good quality. 相似文献

10.

Boosting question answering over knowledge graph with reward integration and policy evaluation under weak supervision

《Information processing & management》2023,60(2):103242

Among existing knowledge graph based question answering (KGQA) methods, relation supervision methods require labeled intermediate relations for stepwise reasoning. To avoid this enormous cost of labeling on large-scale knowledge graphs, weak supervision methods, which use only the answer entity to evaluate rewards as supervision, have been introduced. However, lacking intermediate supervision raises the issue of sparse rewards, which may result in two types of incorrect reasoning path: (1) incorrectly reasoned relations, even when the final answer entity may be correct; (2) correctly reasoned relations in a wrong order, which leads to an incorrect answer entity. To address these issues, this paper considers the multi-hop KGQA task as a Markov decision process, and proposes a model based on Reward Integration and Policy Evaluation (RIPE). In this model, an integrated reward function is designed to evaluate the reasoning process by leveraging both terminal and instant rewards. The intermediate supervision for each single reasoning hop is constructed with regard to both the fitness of the taken action and the evaluation of the unreasoned information remained in the updated question embeddings. In addition, to lead the agent to the answer entity along the correct reasoning path, an evaluation network is designed to evaluate the taken action in each hop. Extensive ablation studies and comparative experiments are conducted on four KGQA benchmark datasets. The results demonstrate that the proposed model outperforms the state-of-the-art approaches in terms of answering accuracy. 相似文献

11.

Document replication strategies for geographically distributed web search engines

Enver Kayaaslan B. Barla Cambazoglu Cevdet Aykanat 《Information processing & management》2013

Large-scale web search engines are composed of multiple data centers that are geographically distant to each other. Typically, a user query is processed in a data center that is geographically close to the origin of the query, over a replica of the entire web index. Compared to a centralized, single-center search engine, this architecture offers lower query response times as the network latencies between the users and data centers are reduced. However, it does not scale well with increasing index sizes and query traffic volumes because queries are evaluated on the entire web index, which has to be replicated and maintained in all data centers. As a remedy to this scalability problem, we propose a document replication framework in which documents are selectively replicated on data centers based on regional user interests. Within this framework, we propose three different document replication strategies, each optimizing a different objective: reducing the potential search quality loss, the average query response time, or the total query workload of the search system. For all three strategies, we consider two alternative types of capacity constraints on index sizes of data centers. Moreover, we investigate the performance impact of query forwarding and result caching. We evaluate our strategies via detailed simulations, using a large query log and a document collection obtained from the Yahoo! web search engine. 相似文献

12.

中外文搜索引擎自然语言问答能力的比较与评价研究

下载免费PDF全文

赵一鸣刘炫彤《情报科学》2020,38(1):67-74

【目的/意义】对Google、Bing、百度和搜狗四个中外文搜索引擎的自然语言问答能力进行评价,以揭示搜索引擎正在向兼具搜索和自动问答功能的系统演进的趋势,对不同搜索引擎在不同类型问题上的自然语言回答能力进行比较。【方法/过程】从文本检索会议和自然语言处理与中文计算会议的问答系统评测项目抽取了三类问题（人物类、时间类、地点类）,并进行搜索,以搜索引擎是否返回准确答案或包含正确答案的精选摘要为标准进行人工评分,使用单因素方差分析和多重比较检验的方法进行比较分析。【结果/结论】主流的中外文搜索引擎均已具备一定的自然语言问答能力,但仍存在较大的提升空间。Google总体表现最好,但对于人物类问题的回答能力弱于搜狗。中外文搜索引擎在时间类问题上的表现均好于人物类和地点类问题。相似文献

13.

远程监督实体关系抽取研究

下载免费PDF全文

柯佳《情报科学》2021,39(10):165-169

【目的/意义】实体关系抽取是构建领域本体、知识图谱、开发问答系统的基础工作。远程监督方法将大规模非结构化文本与已有的知识库实体对齐,自动标注训练样本,解决了有监督机器学习方法人工标注训练语料耗时费力的问题,但也带来了数据噪声。【方法/过程】本文详细梳理了近些年远程监督结合深度学习技术,降低训练样本噪声,提升实体关系抽取性能的方法。【结果/结论】卷积神经网络能更好的捕获句子局部、关键特征、长短时记忆网络能更好的处理句子实体对远距离依赖关系,模型自动抽取句子词法、句法特征,注意力机制给予句子关键上下文、单词更大的权重,在神经网络模型中融入先验知识能丰富句子实体对的语义信息,显著提升关系抽取性能。【创新/局限】下一步的研究应考虑实体对重叠关系、实体对长尾语义关系的处理方法,更加全面的解决实体对关系噪声问题。相似文献

14.

一种基于领域的语义搜索引擎模型SSEM

李江华时鹏《情报杂志》2012,31(4):112-116

Internet已成为全球最丰富的数据源,数据类型繁杂且动态变化,如何从中快速准确地检索出用户所需要的信息是一个亟待解决的问题.传统的搜索引擎基于语法的方式进行搜索,缺乏语义信息,难以准确地表达用户的查询需求和被检索对象的文档语义,致使查准率和查全率较低且搜索范围有限.本文对现有的语义检索方法进行了研究,分析了其中存在的问题,在此基础上提出了一种基于领域的语义搜索引擎模型,结合语义Web技术,使用领域本体元数据模型对用户的查询进行语义化规范,依据领域本体模式抽取文档中的知识并RDF化,准确地表达了用户的查询语义和作为被查询对象的文档语义,可以大大提高检索的准确性和检索效率,详细地给出了模型的体系结构、基本功能和工作原理. 相似文献

15.

Learning to select the correct answer in multi-stream question answering

Alberto Téllez-Valero Manuel Montes-y-Gómez Luis Villaseñor-Pineda Anselmo Peñas Padilla 《Information processing & management》2011

Question answering (QA) is the task of automatically answering a question posed in natural language. Currently, there exists several QA approaches, and, according to recent evaluation results, most of them are complementary. That is, different systems are relevant for different kinds of questions. Somehow, this fact indicates that a pertinent combination of various systems should allow to improve the individual results. This paper focuses on this problem, namely, the selection of the correct answer from a given set of responses corresponding to different QA systems. In particular, it proposes a supervised multi-stream approach that decides about the correctness of answers based on a set of features that describe: (i) the compatibility between question and answer types, (ii) the redundancy of answers across streams, as well as (iii) the overlap and non-overlap information between the question–answer pair and the support text. Experimental results are encouraging; evaluated over a set of 190 questions in Spanish and using answers from 17 different QA systems, our multi-stream QA approach could reach an estimated QA performance of 0.74, significantly outperforming the estimated performance from the best individual system (0.53) as well as the result from best traditional multi-stream QA approach (0.60). 相似文献

16.

Object identification and retrieval from efficient image matching. Snap2Tell with the STOIC dataset

Jean-Pierre Chevallet Joo-Hwee Lim Mun-Kew Leong 《Information processing & management》2007

Traditional content based image retrieval attempts to retrieve images using syntactic features for a query image. Annotated image banks and Google allow the use of text to retrieve images. In this paper, we studied the task of using the content of an image to retrieve information in general. We describe the significance of object identification in an information retrieval paradigm that uses image set as intermediate means in indexing and matching. We also describe a unique Singapore Tourist Object Identification Collection with associated queries and relevance judgments for evaluating the new task and the need for efficient image matching using simple image features. We present comprehensive experimental evaluation on the effects of feature dimensions, context, spatial weightings, coverage of image indexes, and query devices on task performance. Lastly we describe the current system developed to support mobile image-based tourist information retrieval. 相似文献

17.

Querying XML documents using Prolog engines: When is this a good idea?

《Information processing & management》2019,56(5):1753-1770

XML has become a universal standard for information exchange over the Web due to features such as simple syntax and extensibility. Processing queries over these documents has been the focus of several research groups. In fact, there is broad literature in efficient XML query processing which explore indexes, fragmentation techniques, etc. However, for answering complex queries, existing approaches mainly analyze information that is explicitly defined in the XML document. A few work investigate the use of Prolog to increase the query possibilities, allowing inference over the data content. This can cause a significant increase in the query possibilities and expressive power, allowing access to non-obvious information. However, this requires translating the XML documents into Prolog facts. But for regular queries (which do not require inference), is this a good alternative? What kind of queries could benefit from the Prolog translation? Can we always use Prolog engines to execute XML queries in an efficient way? There are many questions involved in adopting an alternative approach to run XML queries. In this work, we investigate this matter by translating XML queries into Prolog queries and comparing the query processing times using Prolog and native XML engines. Our work contributes by providing a set of heuristics that helps users to decide when to use Prolog engines to process a given XML query. In summary, our results show that queries that search elements by a key value or by its position (simple search) are more efficient when run in Prolog than in native XML engines. Also, queries over large datasets, or that searches for substrings perform better when run by native XML engines. 相似文献

18.

基于景观生态和马尔可夫过程的西安地区土地利用变化分析 总被引：16，自引：4，他引：16

解修平周杰张海龙葛俊逸郎海鸥张宏宾《资源科学》2006,28(6):175-181

本文利用三期TM影像，通过遥感与GIS技术，结合景观生态学理论和马尔可夫过程，深入分析了15年来西安地区土地利用变化的数量和空间特征以及由此所引起的一系列生态环境效应，主要表现在：城镇建设用地、果园与未利用地面积显著增加，农用地和水体面积呈减少趋势，城镇建设用地的增加是以侵占大量农田为代价的。景观破碎度变大，形状趋于复杂，优势度降低，多样性和均匀性增大，景观类型有向多样性或均衡化方向发展的趋势。景观的匀质化发展降低了景观抗干扰的能力，同时导致景观稳定性降低。为了弥补传统上由各类土地总面积的变化来解释变迁的缺点，研究利用马尔可夫过程对研究区的土地利用变化进行理论趋势值分析，分析表明研究区内土地利用变化以各类型转化为城镇建设用地的趋势最强，在一定程度上显示了研究区内正处于城市化的快速阶段。 相似文献

19.

Propagating sentiment signals for estimating reputation polarity

《Information processing & management》2019,56(6):102079

The emergence of social media and the huge amount of opinions that are posted everyday have influenced online reputation management. Reputation experts need to filter and control what is posted online and, more importantly, determine if an online post is going to have positive or negative implications towards the entity of interest. This task is challenging, considering that there are posts that have implications on an entity's reputation but do not express any sentiment. In this paper, we propose two approaches for propagating sentiment signals to estimate reputation polarity of tweets. The first approach is based on sentiment lexicons augmentation, whereas the second is based on direct propagation of sentiment signals to tweets that discuss the same topic. In addition, we present a polar fact filter that is able to differentiate between reputation-bearing and reputation-neutral tweets. Our experiments indicate that weakly supervised annotation of reputation polarity is feasible and that sentiment signals can be propagated to effectively estimate the reputation polarity of tweets. Finally, we show that learning PMI values from the training data is the most effective approach for reputation polarity analysis. 相似文献

20.

Estimating Gaussian mixture models in the local neighbourhood of embedded word vectors for query performance prediction

《Information processing & management》2019,56(3):1026-1045

The study of query performance prediction (QPP) in information retrieval (IR) aims to predict retrieval effectiveness. The specificity of the underlying information need of a query often determines how effectively can a search engine retrieve relevant documents at top ranks. The presence of ambiguous terms makes a query less specific to the sought information need, which in turn may degrade IR effectiveness. In this paper, we propose a novel word embedding based pre-retrieval feature which measures the ambiguity of each query term by estimating how many ‘senses’ each word is associated with. Assuming each sense roughly corresponds to a Gaussian mixture component, our proposed generative model first estimates a Gaussian mixture model (GMM) from the word vectors that are most similar to the given query terms. We then use the posterior probabilities of generating the query terms themselves from this estimated GMM in order to quantify the ambiguity of the query. Previous studies have shown that post-retrieval QPP approaches often outperform pre-retrieval ones because they use additional information from the top ranked documents. To achieve the best of both worlds, we formalize a linear combination of our proposed GMM based pre-retrieval predictor with NQC, a state-of-the-art post-retrieval QPP. Our experiments on the TREC benchmark news and web collections demonstrate that our proposed hybrid QPP approach (in linear combination with NQC) significantly outperforms a range of other existing pre-retrieval approaches in combination with NQC used as baselines. 相似文献