首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Measuring effectiveness of information retrieval (IR) systems is essential for research and development and for monitoring search quality in dynamic environments. In this study, we employ new methods for automatic ranking of retrieval systems. In these methods, we merge the retrieval results of multiple systems using various data fusion algorithms, use the top-ranked documents in the merged result as the “(pseudo) relevant documents,” and employ these documents to evaluate and rank the systems. Experiments using Text REtrieval Conference (TREC) data provide statistically significant strong correlations with human-based assessments of the same systems. We hypothesize that the selection of systems that would return documents different from the majority could eliminate the ordinary systems from data fusion and provide better discrimination among the documents and systems. This could improve the effectiveness of automatic ranking. Based on this intuition, we introduce a new method for the selection of systems to be used for data fusion. For this purpose, we use the bias concept that measures the deviation of a system from the norm or majority and employ the systems with higher bias in the data fusion process. This approach provides even higher correlations with the human-based results. We demonstrate that our approach outperforms the previously proposed automatic ranking methods.  相似文献   

2.
This paper presents a robust and comprehensive graph-based rank aggregation approach, used to combine results of isolated ranker models in retrieval tasks. The method follows an unsupervised scheme, which is independent of how the isolated ranks are formulated. Our approach is able to combine arbitrary models, defined in terms of different ranking criteria, such as those based on textual, image or hybrid content representations.We reformulate the ad-hoc retrieval problem as a document retrieval based on fusion graphs, which we propose as a new unified representation model capable of merging multiple ranks and expressing inter-relationships of retrieval results automatically. By doing so, we claim that the retrieval system can benefit from learning the manifold structure of datasets, thus leading to more effective results. Another contribution is that our graph-based aggregation formulation, unlike existing approaches, allows for encapsulating contextual information encoded from multiple ranks, which can be directly used for ranking, without further computations and post-processing steps over the graphs. Based on the graphs, a novel similarity retrieval score is formulated using an efficient computation of minimum common subgraphs. Finally, another benefit over existing approaches is the absence of hyperparameters.A comprehensive experimental evaluation was conducted considering diverse well-known public datasets, composed of textual, image, and multimodal documents. Performed experiments demonstrate that our method reaches top performance, yielding better effectiveness scores than state-of-the-art baseline methods and promoting large gains over the rankers being fused, thus demonstrating the successful capability of the proposal in representing queries based on a unified graph-based model of rank fusions.  相似文献   

3.
Semi-supervised document retrieval   总被引:2,自引:0,他引:2  
This paper proposes a new machine learning method for constructing ranking models in document retrieval. The method, which is referred to as SSRank, aims to use the advantages of both the traditional Information Retrieval (IR) methods and the supervised learning methods for IR proposed recently. The advantages include the use of limited amount of labeled data and rich model representation. To do so, the method adopts a semi-supervised learning framework in ranking model construction. Specifically, given a small number of labeled documents with respect to some queries, the method effectively labels the unlabeled documents for the queries. It then uses all the labeled data to train a machine learning model (in our case, Neural Network). In the data labeling, the method also makes use of a traditional IR model (in our case, BM25). A stopping criterion based on machine learning theory is given for the data labeling process. Experimental results on three benchmark datasets and one web search dataset indicate that SSRank consistently and almost always significantly outperforms the baseline methods (unsupervised and supervised learning methods), given the same amount of labeled data. This is because SSRank can effectively leverage the use of unlabeled data in learning.  相似文献   

4.
This paper studies how to learn accurate ranking functions from noisy training data for information retrieval. Most previous work on learning to rank assumes that the relevance labels in the training data are reliable. In reality, however, the labels usually contain noise due to the difficulties of relevance judgments and several other reasons. To tackle the problem, in this paper we propose a novel approach to learning to rank, based on a probabilistic graphical model. Considering that the observed label might be noisy, we introduce a new variable to indicate the true label of each instance. We then use a graphical model to capture the joint distribution of the true labels and observed labels given features of documents. The graphical model distinguishes the true labels from observed labels, and is specially designed for ranking in information retrieval. Therefore, it helps to learn a more accurate model from noisy training data. Experiments on a real dataset for web search show that the proposed approach can significantly outperform previous approaches.  相似文献   

5.
Question answering (QA) aims at finding exact answers to a user’s question from a large collection of documents. Most QA systems combine information retrieval with extraction techniques to identify a set of likely candidates and then utilize some ranking strategy to generate the final answers. This ranking process can be challenging, as it entails identifying the relevant answers amongst many irrelevant ones. This is more challenging in multi-strategy QA, in which multiple answering agents are used to extract answer candidates. As answer candidates come from different agents with different score distributions, how to merge answer candidates plays an important role in answer ranking. In this paper, we propose a unified probabilistic framework which combines multiple evidence to address challenges in answer ranking and answer merging. The hypotheses of the paper are that: (1) the framework effectively combines multiple evidence for identifying answer relevance and their correlation in answer ranking, (2) the framework supports answer merging on answer candidates returned by multiple extraction techniques, (3) the framework can support list questions as well as factoid questions, (4) the framework can be easily applied to a different QA system, and (5) the framework significantly improves performance of a QA system. An extensive set of experiments was done to support our hypotheses and demonstrate the effectiveness of the framework. All of the work substantially extends the preliminary research in Ko et al. (2007a). A probabilistic framework for answer selection in question answering. In: Proceedings of NAACL/HLT.  相似文献   

6.
Diversification of web search results aims to promote documents with diverse content (i.e., covering different aspects of a query) to the top-ranked positions, to satisfy more users, enhance fairness and reduce bias. In this work, we focus on the explicit diversification methods, which assume that the query aspects are known at the diversification time, and leverage supervised learning methods to improve their performance in three different frameworks with different features and goals. First, in the LTRDiv framework, we focus on applying typical learning to rank (LTR) algorithms to obtain a ranking where each top-ranked document covers as many aspects as possible. We argue that such rankings optimize various diversification metrics (under certain assumptions), and hence, are likely to achieve diversity in practice. Second, in the AspectRanker framework, we apply LTR for ranking the aspects of a query with the goal of more accurately setting the aspect importance values for diversification. As features, we exploit several pre- and post-retrieval query performance predictors (QPPs) to estimate how well a given aspect is covered among the candidate documents. Finally, in the LmDiv framework, we cast the diversification problem into an alternative fusion task, namely, the supervised merging of rankings per query aspect. We again use QPPs computed over the candidate set for each aspect, and optimize an objective function that is tailored for the diversification goal. We conduct thorough comparative experiments using both the basic systems (based on the well-known BM25 matching function) and the best-performing systems (with more sophisticated retrieval methods) from previous TREC campaigns. Our findings reveal that the proposed frameworks, especially AspectRanker and LmDiv, outperform both non-diversified rankings and two strong diversification baselines (i.e., xQuAD and its variant) in terms of various effectiveness metrics.  相似文献   

7.
With the development of 3D technology and the increase in 3D models, 2D image-based 3D model retrieval tasks have drawn increased attention from scholars. Previous works align cross-domain features via adversarial domain alignment and semantic alignment. However, the extracted features of previous methods are disturbed by the residual domain-specific features, and the lack of labels for 3D models makes the semantic alignment challenging. Therefore, we propose disentangled feature learning associated with enhanced semantic alignment to address these problems. On one hand, the disentangled feature learning enables decoupling the twisted raw features into the isolated domain-invariant and domain-specific features, and the domain-specific features will be dropped while performing adversarial domain alignment and semantic alignment to acquire domain-invariant features. On the other hand, we mine the semantic consistency by compacting each 3D model sample and its nearest neighbors to further enhance semantic alignment for unlabeled 3D model domain. We give comprehensive experiments on two public datasets, and the results demonstrate the superiority of the proposed method. Especially on MI3DOR-2 dataset, our method outperforms the current state-of-the-art methods with gains of 2.88% for the strictest retrieval metric NN.  相似文献   

8.
This paper is concerned with the quality of training data in learning to rank for information retrieval. While many data selection techniques have been proposed to improve the quality of training data for classification, the study on the same issue for ranking appears to be insufficient. As pointed out in this paper, it is inappropriate to extend technologies for classification to ranking, and the development of novel technologies is sorely needed. In this paper, we study the development of such technologies. To begin with, we propose the concept of “pairwise preference consistency” (PPC) to describe the quality of a training data collection from the ranking point of view. PPC takes into consideration the ordinal relationship between documents as well as the hierarchical structure on queries and documents, which are both unique properties of ranking. Then we select a subset of the original training documents, by maximizing the PPC of the selected subset. We further propose an efficient solution to the maximization problem. Empirical results on the LETOR benchmark datasets and a web search engine dataset show that with the subset of training data selected by our approach, the performance of the learned ranking model can be significantly improved.  相似文献   

9.
相关概念的关联参照检索是概念检索的重要研究内容。本文提出了一种基于主题的语义关联的参照检索模型,通过融合语义网、本体论的相关知识及信息提取等语言处理技术,提取关于特定主题的文档的主题概念及概念之间的关联构成该主题的语义关联模型,并辅助于参照检索过程。  相似文献   

10.
Multi-feature fusion has achieved gratifying performance in image retrieval. However, some existing fusion mechanisms would unfortunately make the result worse than expected due to the domain and visual diversity of images. As a result, a burning problem for applying feature fusion mechanism is how to figure out and improve the complementarity of multi-level heterogeneous features. To this end, this paper proposes an adaptive multi-feature fusion method via cross-entropy normalization for effective image retrieval. First, various low-level features (e.g., SIFT) and high-level semantic features based on deep learning are extracted. Under each level of feature representation, the initial similarity scores of the query image w.r.t. the target dataset are calculated. Second, we use an independent reference dataset to approximate the tail of the attained initial similarity score ranking curve by cross-entropy normalization. Then the area under the ranking curve is calculated as the indicator of the merit of corresponding feature (i.e., a smaller area indicates a more suitable feature.). Finally, fusion weights of each feature are assigned adaptively by the statistically elaborated areas. Extensive experiments on three public benchmark datasets have demonstrated that the proposed method can achieve superior performance compared with the existing methods, improving the metrics mAP by relatively 1.04% (for Holidays), 1.22% (for Oxf5k) and the N-S by relatively 0.04 (for UKbench), respectively.  相似文献   

11.
Breast cancer is one of the leading causes of death among women worldwide. Accurate and early detection of breast cancer can ensure long-term surviving for the patients. However, traditional classification algorithms usually aim only to maximize the classification accuracy, failing to take into consideration the misclassification costs between different categories. Furthermore, the costs associated with missing a cancer case (false negative) are clearly much higher than those of mislabeling a benign one (false positive). To overcome this drawback and further improving the classification accuracy of the breast cancer diagnosis, in this work, a novel breast cancer intelligent diagnosis approach has been proposed, which employed information gain directed simulated annealing genetic algorithm wrapper (IGSAGAW) for feature selection, in this process, we performs the ranking of features according to IG algorithm, and extracting the top m optimal feature utilized the cost sensitive support vector machine (CSSVM) learning algorithm. Our proposed feature selection approach which can not only help to reduce the complexity of SAGASW algorithm and effectively extracting the optimal feature subset to a certain extent, but it can also obtain the maximum classification accuracy and minimum misclassification cost. The efficacy of our proposed approach is tested on Wisconsin Original Breast Cancer (WBC) and Wisconsin Diagnostic Breast Cancer (WDBC) breast cancer data sets, and the results demonstrate that our proposed hybrid algorithm outperforms other comparison methods. The main objective of this study was to apply our research in real clinical diagnostic system and thereby assist clinical physicians in making correct and effective decisions in the future. Moreover our proposed method could also be applied to other illness diagnosis.  相似文献   

12.
李伟 《现代情报》2015,35(3):142-146
信息素质教育融入专业课程教学是高校图书馆开展信息素质教育里最深入、最有效、受益者最多的教育形式。文章分析了模块化信息素质教育和项目式专业课程教学方式和方法,本文认为:两者融入整合教学是可行的,将模块化信息素质教育融入项目式专业课程的课堂教学、网络教学平台和课外学习空间三个方面进行了探索,实践表明:比较灵活的可调控的模块化信息素质教育更易于融入项目式专业课程教学,有助于学生的自主学习和创新思维。  相似文献   

13.
Many machine learning technologies such as support vector machines, boosting, and neural networks have been applied to the ranking problem in information retrieval. However, since originally the methods were not developed for this task, their loss functions do not directly link to the criteria used in the evaluation of ranking. Specifically, the loss functions are defined on the level of documents or document pairs, in contrast to the fact that the evaluation criteria are defined on the level of queries. Therefore, minimizing the loss functions does not necessarily imply enhancing ranking performances. To solve this problem, we propose using query-level loss functions in learning of ranking functions. We discuss the basic properties that a query-level loss function should have and propose a query-level loss function based on the cosine similarity between a ranking list and the corresponding ground truth. We further design a coordinate descent algorithm, referred to as RankCosine, which utilizes the proposed loss function to create a generalized additive ranking model. We also discuss whether the loss functions of existing ranking algorithms can be extended to query-level. Experimental results on the datasets of TREC web track, OHSUMED, and a commercial web search engine show that with the use of the proposed query-level loss function we can significantly improve ranking accuracies. Furthermore, we found that it is difficult to extend the document-level loss functions to query-level loss functions.  相似文献   

14.
郭贵梅 《现代情报》2011,31(8):174-177
本文主要介绍了目前我国网络信息检索用户研究的3个方面,即用户的网络信息检索行为研究现状、用户因素对于网络信息检索过程以及效率的影响以及用户模型构建方面的研究。然后介绍了现有的网络信息检索用户主要的调查方法,最后提出了对于网络信息检索用户研究的展望。  相似文献   

15.
Listwise learning to rank models, which optimize the ranking of a document list, are among the most widely adopted algorithms for finding and ranking relevant documents to user information needs. In this paper, we propose ListMAP, a new listwise learning to rank model with prior distribution that encodes the informativeness of training data and assigns different weights to training instances. The main intuition behind ListMAP is that documents in the training dataset do not have the same impact on training a ranking function. ListMAP formalizes the listwise loss function as a maximum a posteriori estimation problem in which the scoring function must be estimated such that the log probability of the predicted ranked list is maximized given a prior distribution on the labeled data. We provide a model for approximating the prior distribution parameters from a set of observation data. We implement the proposed learning to rank model using neural networks. We theoretically discuss and analyze the characteristics of the introduced model and empirically illustrate its performance on a number of benchmark datasets; namely MQ2007 and MQ2008 of the Letor 4.0 benchmark, Set 1 and Set 2 of the Yahoo! learning to rank challenge data set, and Microsoft 30k and Microsoft 10K datasets. We show that the proposed models are effective across different datasets in terms of information retrieval evaluation metrics NDCG and MRR at positions 1, 3, 5, 10, and 20.  相似文献   

16.
Recent developments have shown that entity-based models that rely on information from the knowledge graph can improve document retrieval performance. However, given the non-transitive nature of relatedness between entities on the knowledge graph, the use of semantic relatedness measures can lead to topic drift. To address this issue, we propose a relevance-based model for entity selection based on pseudo-relevance feedback, which is then used to systematically expand the input query leading to improved retrieval performance. We perform our experiments on the widely used TREC Web corpora and empirically show that our proposed approach to entity selection significantly improves ad hoc document retrieval compared to strong baselines. More concretely, the contributions of this work are as follows: (1) We introduce a graphical probability model that captures dependencies between entities within the query and documents. (2) We propose an unsupervised entity selection method based on the graphical model for query entity expansion and then for ad hoc retrieval. (3) We thoroughly evaluate our method and compare it with the state-of-the-art keyword and entity based retrieval methods. We demonstrate that the proposed retrieval model shows improved performance over all the other baselines on ClueWeb09B and ClueWeb12B, two widely used Web corpora, on the [email protected], and [email protected] metrics. We also show that the proposed method is most effective on the difficult queries. In addition, We compare our proposed entity selection with a state-of-the-art entity selection technique within the context of ad hoc retrieval using a basic query expansion method and illustrate that it provides more effective retrieval for all expansion weights and different number of expansion entities.  相似文献   

17.
基于机器学习的图像检索机制的研究   总被引:1,自引:1,他引:1  
主要针对当前基于低水平特征的图像检索机制不能捕获图像语义的状况 ,讨论了使用长期学习的方法来学习用户的相关反馈 ,以此推断语义 ,构造语义空间 ,并结合短期学习方法 ,通过运用学习监督机制来推断用户信息需求 ,优化查询 ,逐步提高搜索引擎的性能  相似文献   

18.
Ranking is a central component in information retrieval systems; as such, many machine learning methods for building rankers have been developed in recent years. An open problem is transfer learning, i.e. how labeled training data from one domain/market can be used to build rankers for another. We propose a flexible transfer learning strategy based on sample selection. Source domain training samples are selected if the functional relationship between features and labels do not deviate much from that of the target domain. This is achieved through a novel application of recent advances from density ratio estimation. The approach is flexible, scalable, and modular. It allows many existing supervised rankers to be adapted to the transfer learning setting. Results on two datasets (Yahoo’s Learning to Rank Challenge and Microsoft’s LETOR data) show that the proposed method gives robust improvements.  相似文献   

19.
Due to the proliferation and abundance of information on the web, ranking algorithms play an important role in web search. Currently, there are some ranking algorithms based on content and connectivity such as BM25 and PageRank. Unfortunately, these algorithms have low precision and are not always satisfying for users. In this paper, we propose an adaptive method, called A3CRank, based on the content, connectivity, and click-through data triple. Our method tries to aggregate ranking algorithms such as BM25, PageRank, and TF-IDF. We have used reinforcement learning to incorporate user behavior and find a measure of user satisfaction for each ranking algorithm. Furthermore, OWA, an aggregation operator is used for merging the results of the various ranking algorithms. A3CRank adapts itself with user needs and makes use of user clicks to aggregate the results of ranking algorithms. A3CRank is designed to overcome some of the shortcomings of existing ranking algorithms by combining them together and producing an overall better ranking criterion. Experimental results indicate that A3CRank outperforms other combinational ranking algorithms such as Ranking SVM in terms of P@n and NDCG metrics. We have used 130 queries on University of California at Berkeley’s web to train and evaluate our method.  相似文献   

20.
This paper presents a novel IR-style keyword search model for semantic web data retrieval, distinguished from current retrieval methods. In this model, an answer to a keyword query is a connected subgraph that contains all the query keywords. In addition, the answer is minimal because any proper subgraph can not be an answer to the query. We provide an approximation algorithm to retrieve these answers efficiently. A special ranking strategy is also proposed so that answers can be appropriately ordered. The experimental results over real datasets show that our model outperforms existing possible solutions with respect to effectiveness and efficiency.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号