首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper presents a laboratory based evaluation study of cross-language information retrieval technologies, utilizing partially parallel test collections, NTCIR-2 (used together with NTCIR-1), where Japanese–English parallel document collections, parallel topic sets and their relevance judgments are available. These enable us to observe and compare monolingual retrieval processes in two languages as well as retrieval across languages. Our experiments focused on (1) the Rosetta stone question (whether a partially parallel collection helps in cross-language information access or not?) and (2) two aspects of retrieval difficulties namely “collection discrepancy” and “query discrepancy”. Japanese and English monolingual retrieval systems are combined by dictionary based query translation modules so that a symmetrical bilingual evaluation environment is implemented.  相似文献   

2.
A comparative study of two types of patent retrieval tasks, technology survey and invalidity search, using the NTCIR-3 and -4 test collections is described, with a focus on pseudo-feedback effectiveness and different retrieval models. Invalidity searches are peculiar to patent retrieval tasks and feature small numbers of relevant documents and long queries. Different behaviors of effectiveness are observed when applying different retrieval models and pseudo-feedback. These different behaviors are analyzed in terms of the “weak cluster hypothesis”, i.e., terminological cohesiveness through relevant documents.  相似文献   

3.
This paper examines the factors affecting the performance of global query expansion based on term co-occurrence data and suggests a way to maximize the retrieval effectiveness. Major parameters to be optimized through experiments are term similarity measure and the weighting scheme of additional terms. The evaluation of four similarity measures tested in query expansion reveal that mutual information and Yule's Y, which emphasize low frequency terms, achieve better performance than cosine and Jaccard coefficients that have the reverse tendency. In the evaluation of three weighting schemes, similarity weight performs well only with short queries, whereas fixed weights of approximately 0.5 and similarity rank weights were effective with queries of any length. Furthermore, the optimal similarity rank weight achieving the best overall performance seems to be the least affected by test collections and the number of additional terms. For the efficiency of retrieval, the number of additional terms needs not exceed 70 in our test collections, but the optimal number may vary according to the characteristics of the similarity measure employed.  相似文献   

4.
The quality of feedback documents is crucial to the effectiveness of query expansion (QE) in ad hoc retrieval. Recently, machine learning methods have been adopted to tackle this issue by training classifiers from feedback documents. However, the lack of proper training data has prevented these methods from selecting good feedback documents. In this paper, we propose a new method, called AdapCOT, which applies co-training in an adaptive manner to select feedback documents for boosting QE’s effectiveness. Co-training is an effective technique for classification over limited training data, which is particularly suitable for selecting feedback documents. The proposed AdapCOT method makes use of a small set of training documents, and labels the feedback documents according to their quality through an iterative process. Two exclusive sets of term-based features are selected to train the classifiers. Finally, QE is performed on the labeled positive documents. Our extensive experiments show that the proposed method improves QE’s effectiveness, and outperforms strong baselines on various standard TREC collections.  相似文献   

5.
This paper proposes a method to improve retrieval performance of the vector space model (VSM) in part by utilizing user-supplied information of those documents that are relevant to the query in question. In addition to the user's relevance feedback information, information such as original document similarities is incorporated into the retrieval model, which is built by using a sequence of linear transformations. High-dimensional and sparse vectors are then reduced by singular value decomposition (SVD) and transformed into a low-dimensional vector space, namely the space representing the latent semantic meanings of words. The method has been tested with two test collections, the Medline collection and the Cranfield collection. In order to train the model, multiple partitions are created for each collection. Improvement of average precision of the averages over all partitions, compared with the latent semantic indexing (LSI) model, are 20.57% (Medline) and 22.23% (Cranfield) for the two training data sets, and 0.47% (Medline) and 4.78% (Cranfield) for the test data, respectively. The proposed method provides an approach that makes it possible to preserve user-supplied relevance information for the long term in the system in order to use it later.  相似文献   

6.
This paper describes our novel retrieval model that is based on contexts of query terms in documents (i.e., document contexts). Our model is novel because it explicitly takes into account of the document contexts instead of implicitly using the document contexts to find query expansion terms. Our model is based on simulating a user making relevance decisions, and it is a hybrid of various existing effective models and techniques. It estimates the relevance decision preference of a document context as the log-odds and uses smoothing techniques as found in language models to solve the problem of zero probabilities. It combines these estimated preferences of document contexts using different types of aggregation operators that comply with different relevance decision principles (e.g., aggregate relevance principle). Our model is evaluated using retrospective experiments (i.e., with full relevance information), because such experiments can (a) reveal the potential of our model, (b) isolate the problems of the model from those of the parameter estimation, (c) provide information about the major factors affecting the retrieval effectiveness of the model, and (d) show that whether the model obeys the probability ranking principle. Our model is promising as its mean average precision is 60–80% in our experiments using different TREC ad hoc English collections and the NTCIR-5 ad hoc Chinese collection. Our experiments showed that (a) the operators that are consistent with aggregate relevance principle were effective in combining the estimated preferences, and (b) that estimating probabilities using the contexts in the relevant documents can produce better retrieval effectiveness than using the entire relevant documents.  相似文献   

7.
Through the recent NTCIR workshops, patent retrieval casts many challenging issues to information retrieval community. Unlike newspaper articles, patent documents are very long and well structured. These characteristics raise the necessity to reassess existing retrieval techniques that have been mainly developed for structure-less and short documents such as newspapers. This study investigates cluster-based retrieval in the context of invalidity search task of patent retrieval. Cluster-based retrieval assumes that clusters would provide additional evidence to match user’s information need. Thus far, cluster-based retrieval approaches have relied on automatically-created clusters. Fortunately, all patents have manually-assigned cluster information, international patent classification codes. International patent classification is a standard taxonomy for classifying patents, and has currently about 69,000 nodes which are organized into a five-level hierarchical system. Thus, patent documents could provide the best test bed to develop and evaluate cluster-based retrieval techniques. Experiments using the NTCIR-4 patent collection showed that the cluster-based language model could be helpful to improving the cluster-less baseline language model.  相似文献   

8.
This paper tackles the problem of how one might select further search terms, using relevance feedback, given the search terms in the query. These search terms are extracted from a maximum spanning tree connecting all the terms in the index term vocabulary. A number of different spanning trees are generated from a variety of association measures. The retrieval effectiveness for the different spanning trees is shown to be approximately the same. Effectiveness is measured in terms of precision and recall, and the retrieval tests are done on three different test collections.  相似文献   

9.
For constrained linear parameter varying systems subject to bounded disturbances and noises, this article investigates an off-line output feedback robust model predictive control approach. The sub-observer gains with robust positively invariant sets, and sub-controller gains with robust control invariant sets are simultaneously off-line optimized and stored in a look-up table. According to real-time estimation error bounds and estimated states, the time-varying sub-observer gains and sub-controller gains are on-line searched. The proposed off-line output feedback robust model predictive control approach with the guarantee of nested robust positively invariant sets and robust control invariant sets in theory reduces the on-line computational burden.  相似文献   

10.
Query translation is a viable method for cross-language information retrieval (CLIR), but it suffers from translation ambiguities caused by multiple translations of individual query terms. Previous research has employed various methods for disambiguation, including the method of selecting an individual target query term from multiple candidates by comparing their statistical associations with the candidate translations of other query terms. This paper proposes a new method where we examine all combinations of target query term translations corresponding to the source query terms, instead of looking at the candidates for each query term and selecting the best one at a time. The goodness value for a combination of target query terms is computed based on the association value between each pair of the terms in the combination. We tested our method using the NTCIR-3 English–Korean CLIR test collection. The results show some improvements regardless of the association measures we used.  相似文献   

11.
In this paper we present a new algorithm for relevance feedback (RF) in information retrieval. Unlike conventional RF algorithms which use the top ranked documents for feedback, our proposed algorithm is a kind of active feedback algorithm which actively chooses documents for the user to judge. The objectives are (a) to increase the number of judged relevant documents and (b) to increase the diversity of judged documents during the RF process. The algorithm uses document-contexts by splitting the retrieval list into sub-lists according to the query term patterns that exist in the top ranked documents. Query term patterns include a single query term, a pair of query terms that occur in a phrase and query terms that occur in proximity. The algorithm is an iterative algorithm which takes one document for feedback in each of the iterations. We experiment with the algorithm using the TREC-6, -7, -8, -2005 and GOV2 data collections and we simulate user feedback using the TREC relevance judgements. From the experimental results, we show that our proposed split-list algorithm is better than the conventional RF algorithm and that our algorithm is more reliable than a similar algorithm using maximal marginal relevance.  相似文献   

12.
Re-using research resources is essential for advancing knowledge and developing repeatable, empirically solid experiments in scientific fields, including interactive information retrieval (IIR). Despite recent efforts on standardizing research re-use and documentation, how to quantitatively measure the reusability of IIR resources still remains an open challenge. Inspired by the reusability evaluations on Cranfield experiments, our work proactively explores the problem of measuring IIR test collection reusability and makes threefold contributions: (1) constructing a novel usefulness-oriented framework with specific analytical methods for evaluating the reusability of IIR test collections consisting of query sets, document/page sets, and sets of task-document usefulness (tuse); (2) explaining the potential impacts of varying IIR-specific factors (e.g. search tasks, sessions, user characteristics) on test collection reusability; (3) proposing actionable methods for building reusable test collections in IIR and thereby amortizing the true cost of user-oriented evaluations. The Cranfield-inspired reusability assessment framework serves as an initial step towards accurately evaluating the reusability of IIR research resources and measuring the reproducibility of IIR evaluation results. It also demonstrates an innovative approach to integrating the insights from individual heterogeneous user studies with the evaluation techniques developed in standardized ad hoc retrieval experiments, which will facilitate the maturation of IIR fields and eventually benefits both sides of research.  相似文献   

13.
Most existing search engines focus on document retrieval. However, information needs are certainly not limited to finding relevant documents. Instead, a user may want to find relevant entities such as persons and organizations. In this paper, we study the problem of related entity finding. Our goal is to rank entities based on their relevance to a structured query, which specifies an input entity, the type of related entities and the relation between the input and related entities. We first discuss a general probabilistic framework, derive six possible retrieval models to rank the related entities, and then compare these models both analytically and empirically. To further improve performance, we study the problem of feedback in the context of related entity finding. Specifically, we propose a mixture model based feedback method that can utilize the pseudo feedback entities to estimate an enriched model for the relation between the input and related entities. Experimental results over two standard TREC collections show that the derived relation generation model combined with a relation feedback method performs better than other models.  相似文献   

14.
The classical probabilistic models attempt to capture the ad hoc information retrieval problem within a rigorous probabilistic framework. It has long been recognized that the primary obstacle to the effective performance of the probabilistic models is the need to estimate a relevance model. The Dirichlet compound multinomial (DCM) distribution based on the Polya Urn scheme, which can also be considered as a hierarchical Bayesian model, is a more appropriate generative model than the traditional multinomial distribution for text documents. We explore a new probabilistic model based on the DCM distribution, which enables efficient retrieval and accurate ranking. Because the DCM distribution captures the dependency of repetitive word occurrences, the new probabilistic model based on this distribution is able to model the concavity of the score function more effectively. To avoid the empirical tuning of retrieval parameters, we design several parameter estimation algorithms to automatically set model parameters. Additionally, we propose a pseudo-relevance feedback algorithm based on the mixture modeling of the Dirichlet compound multinomial distribution to further improve retrieval accuracy. Finally, our experiments show that both the baseline probabilistic retrieval algorithm based on the DCM distribution and the corresponding pseudo-relevance feedback algorithm outperform the existing language modeling systems on several TREC retrieval tasks. The main objective of this research is to develop an effective probabilistic model based on the DCM distribution. A secondary objective is to provide a thorough understanding of the probabilistic retrieval model by a theoretical understanding of various text distribution assumptions.  相似文献   

15.
We are interested in how ideas from document clustering can be used to improve the retrieval accuracy of ranked lists in interactive systems. In particular, we are interested in ways to evaluate the effectiveness of such systems to decide how they might best be constructed. In this study, we construct and evaluate systems that present the user with ranked lists and a visualization of inter-document similarities. We first carry out a user study to evaluate the clustering/ranked list combination on instance-oriented retrieval, the task of the TREC-6 Interactive Track. We find that although users generally prefer the combination, they are not able to use it to improve effectiveness. In the second half of this study, we develop and evaluate an approach that more directly combines the ranked list with information from inter-document similarities. Using the TREC collections and relevance judgments, we show that it is possible to realize substantial improvements in effectiveness by doing so, and that although users can use the combined information effectively, the system can provide hints that substantially improve on the user's solo effort. The resulting approach shares much in common with an interactive application of incremental relevance feedback. Throughout this study, we illustrate our work using two prototype systems constructed for these evaluations. The first, AspInQuery, is a classic information retrieval system augmented with a specialized tool for recording information about instances of relevance. The other system, Lighthouse, is a Web-based application that combines a ranked list with a portrayal of inter-document similarity. Lighthouse can work with collections such as TREC, as well as the results of Web search engines.  相似文献   

16.
It is well-known that relevance feedback is a method significant in improving the effectiveness of information retrieval systems. Improving effectiveness is important since these information retrieval systems must gain access to large document collections distributed over different distant sites. As a consequence, efforts to retrieve relevant documents have become significantly greater. Relevance feedback can be viewed as an aid to the information retrieval task. In this paper, a relevance feedback strategy is presented. The strategy is based on back-propagation of the relevance of retrieved documents using an algorithm developed in a neural approach. This paper describes a neural information retrieval model and emphasizes the results obtained with the associated relevance back-propagation algorithm in three different environments: manual ad hoc, automatic ad hoc and mixed ad hoc strategy (automatic plus manual ad hoc).  相似文献   

17.
Rocchio relevance feedback and latent semantic indexing (LSI) are well-known extensions of the vector space model for information retrieval (IR). This paper analyzes the statistical relationship between these extensions. The analysis focuses on each method’s basis in least-squares optimization. Noting that LSI and Rocchio relevance feedback both alter the vector space model in a way that is in some sense least-squares optimal, we ask: what is the relationship between LSI’s and Rocchio’s notions of optimality? What does this relationship imply for IR? Using an analytical approach, we argue that Rocchio relevance feedback is optimal if we understand retrieval as a simplified classification problem. On the other hand, LSI’s motivation comes to the fore if we understand it as a biased regression technique, where projection onto a low-dimensional orthogonal subspace of the documents reduces model variance.  相似文献   

18.
Pseudo-relevance feedback is the basis of a category of automatic query modification techniques. Pseudo-relevance feedback methods assume the initial retrieved set of documents to be relevant. Then they use these documents to extract more relevant terms for the query or just re-weigh the user's original query. In this paper, we propose a straightforward, yet effective use of pseudo-relevance feedback method in detecting more informative query terms and re-weighting them. The query-by-query analysis of our results indicates that our method is capable of identifying the most important keywords even in short queries. Our main idea is that some of the top documents may contain a closer context to the user's information need than the others. Therefore, re-examining the similarity of those top documents and weighting this set based on their context could help in identifying and re-weighting informative query terms. Our experimental results in standard English and Persian test collections show that our method improves retrieval performance, in terms of MAP criterion, up to 7% over traditional query term re-weighting methods.  相似文献   

19.
In information retrieval, the task of query performance prediction (QPP) is concerned with determining in advance the performance of a given query within the context of a retrieval model. QPP has an important role in ensuring proper handling of queries with varying levels of difficulty. Based on the extant literature, query specificity is an important indicator of query performance and is typically estimated using corpus-specific frequency-based specificity metrics However, such metrics do not consider term semantics and inter-term associations. Our work presented in this paper distinguishes itself by proposing a host of corpus-independent specificity metrics that are based on pre-trained neural embeddings and leverage geometric relations between terms in the embedding space in order to capture the semantics of terms and their interdependencies. Specifically, we propose three classes of specificity metrics based on pre-trained neural embeddings: neighborhood-based, graph-based, and cluster-based metrics. Through two extensive and complementary sets of experiments, we show that the proposed specificity metrics (1) are suitable specificity indicators, based on the gold standards derived from knowledge hierarchies (Wikipedia category hierarchy and DMOZ taxonomy), and (2) have better or competitive performance compared to the state of the art QPP metrics, based on both TREC ad hoc collections namely Robust’04, Gov2 and ClueWeb’09 and ANTIQUE question answering collection. The proposed graph-based specificity metrics, especially those that capture a larger number of inter-term associations, proved to be the most effective in both query specificity estimation and QPP. We have also publicly released two test collections (i.e. specificity gold standards) that we built from the Wikipedia and DMOZ knowledge hierarchies.  相似文献   

20.
In this paper, the boundary stabilization problem of a class of unstable reaction–advection–diffusion (RAD) systems described by a scalar parabolic partial differential equation (PDE) is considered. Different the previous research, we present a new gradient-based optimization framework for designing the optimal feedback kernel for stabilizing the unstable PDE system. Our new method does not require solving non-standard Riccati-type or Klein–Gorden-type PDEs. Instead, the feedback kernel is parameterized as a second-order polynomial whose coefficients are decision variables to be tuned via gradient-based dynamic optimization, where the gradients of the system cost functional (which penalizes both kernel and output magnitude) with respect to the decision parameters are computed by solving a so-called “costate” PDE in standard form. Special constraints are imposed on the kernel coefficients to ensure that the optimized kernel yields closed-loop stability. Finally, three numerical examples are illustrated to verify the effectiveness of the proposed approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号