首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Two-stage statistical language models for text database selection   总被引:2,自引:0,他引:2  
As the number and diversity of distributed Web databases on the Internet exponentially increase, it is difficult for user to know which databases are appropriate to search. Given database language models that describe the content of each database, database selection services can provide assistance in locating databases relevant to the information needs of users. In this paper, we propose a database selection approach based on statistical language modeling. The basic idea behind the approach is that, for databases that are categorized into a topic hierarchy, individual language models are estimated at different search stages, and then the databases are ranked by the similarity to the query according to the estimated language model. Two-stage smoothed language models are presented to circumvent inaccuracy due to word sparseness. Experimental results demonstrate that such a language modeling approach is competitive with current state-of-the-art database selection approaches.  相似文献   

2.
Index maintenance strategies employed by dynamic text retrieval systems based on inverted files can be divided into two categories: merge-based and in-place update strategies. Within each category, individual update policies can be distinguished based on whether they store their on-disk posting lists in a contiguous or in a discontiguous fashion. Contiguous inverted lists, in general, lead to higher query performance, by minimizing the disk seek overhead at query time, while discontiguous inverted lists lead to higher update performance, requiring less effort during index maintenance operations. In this paper, we focus on retrieval systems with high query load, where the on-disk posting lists have to be stored in a contiguous fashion at all times. We discuss a combination of re-merge and in-place index update, called Hybrid Immediate Merge. The method performs strictly better than the re-merge baseline policy used in our experiments, as it leads to the same query performance, but substantially better update performance. The actual time savings achievable depend on the size of the text collection being indexed; a larger collection results in greater savings. In our experiments, variations of Hybrid Immediate Merge were able to reduce the total index update overhead by up to 73% compared to the re-merge baseline.
Stefan BüttcherEmail:
  相似文献   

3.
In retrieving medical free text, users are often interested in answers pertinent to certain scenarios that correspond to common tasks performed in medical practice, e.g., treatment or diagnosis of a disease. A major challenge in handling such queries is that scenario terms in the query (e.g., treatment) are often too general to match specialized terms in relevant documents (e.g., chemotherapy). In this paper, we propose a knowledge-based query expansion method that exploits the UMLS knowledge source to append the original query with additional terms that are specifically relevant to the query's scenario(s). We compared the proposed method with traditional statistical expansion that expands terms which are statistically correlated but not necessarily scenario specific. Our study on two standard testbeds shows that the knowledge-based method, by providing scenario-specific expansion, yields notable improvements over the statistical method in terms of average precision-recall. On the OHSUMED testbed, for example, the improvement is more than 5% averaging over all scenario-specific queries studied and about 10% for queries that mention certain scenarios, such as treatment of a disease and differential diagnosis of a symptom/disease.
Wesley W. ChuEmail:
  相似文献   

4.
Intelligent use of the many diverse forms of data available on the Internet requires new tools for managing and manipulating heterogeneous forms of information. This paper uses WHIRL, an extension of relational databases that can manipulate textual data using statistical similarity measures developed by the information retrieval community. We show that although WHIRL is designed for more general similarity-based reasoning tasks, it is competitive with mature systems designed explicitly for inductive classification. In particular, WHIRL is well suited for combining different sources of knowledge in the classification process. We show on a diverse set of tasks that the use of appropriate sets of unlabeled background knowledge often decreases error rates, particularly if the number of examples or the size of the strings in the training set is small. This is especially useful when labeling text is a labor-intensive job and when there is a large amount of information available about a particular problem on the World Wide Web.
Haym HirshEmail:
  相似文献   

5.
This study examines the use of an ontology as a search tool. Sixteen subjects created queries using Concept-based Information Retrieval Interface (CIRI) and a regular baseline IR interface. The simulated work task method was used to make the searching situations realistic. Subjects’ search experiences, queries and search results were examined. The numbers of search concepts and keys, as well as their overlap in the queries were investigated. The effectiveness of the CIRI and baseline queries was compared. An Ontology Index (OI) was calculated for all search tasks and the correlation between the OI and the overlap of search concepts and keys in queries was investigated. The number of search keys and concepts was higher in CIRI queries than in baseline interface queries. Also the overlap of search keys was higher among CIRI users than among baseline users. These both findings are due to CIRI’s expansion feature. There was no clear correlation between OI and overlap of search concepts and keys. The search results were evaluated with generalised precision and recall, and relevance scores based on individual relevance assessments. The baseline interface queries performed better in all comparisons, but the difference was statistically significant only in relevance scores based on individual relevance assessments.  相似文献   

6.
In distributed information retrieval systems, document overlaps occur frequently among different component databases. This paper presents an experimental investigation and evaluation of a group of result merging methods including the shadow document method and the multi-evidence method in the environment of overlapping databases. We assume, with the exception of resultant document lists (either with rankings or scores), no extra information about retrieval servers and text databases is available, which is the usual case for many applications on the Internet and the Web. The experimental results show that the shadow document method and the multi-evidence method are the two best methods when overlap is high, while Round-robin is the best for low overlap. The experiments also show that [0,1] linear normalization is a better option than linear regression normalization for result merging in a heterogeneous environment.
Sally McCleanEmail:
  相似文献   

7.
With the help of a team of expert biologist judges, the TREC Genomics track has generated four large sets of “gold standard” test collections, comprised of over a hundred unique topics, two kinds of ad hoc retrieval tasks, and their corresponding relevance judgments. Over the years of the track, increasingly complex tasks necessitated the creation of judging tools and training guidelines to accommodate teams of part-time short-term workers from a variety of specialized biological scientific backgrounds, and to address consistency and reproducibility of the assessment process. Important lessons were learned about factors that influenced the utility of the test collections including topic design, annotations provided by judges, methods used for identifying and training judges, and providing a central moderator “meta-judge”.  相似文献   

8.
The effective representation of the relationship between the documents and their contents is crucial to increase classification performance of text documents in the text classification. Term weighting is a preprocess aiming to represent text documents better in Vector Space by assigning proper weights to terms. Since the calculation of the appropriate weight values directly affects performance of the text classification, in the literature, term weighting is still one of the important sub-research areas of text classification. In this study, we propose a novel term weighting (MONO) strategy which can use the non-occurrence information of terms more effectively than existing term weighting approaches in the literature. The proposed weighting strategy also performs intra-class document scaling to supply better representations of distinguishing capabilities of terms occurring in the different quantity of documents in the same quantity of class. Based on the MONO weighting strategy, two novel supervised term weighting schemes called TF-MONO and SRTF-MONO were proposed for text classification. The proposed schemes were tested with two different classifiers such as SVM and KNN on 3 different datasets named Reuters-21578, 20-Newsgroups, and WebKB. The classification performances of the proposed schemes were compared with 5 different existing term weighting schemes in the literature named TF-IDF, TF-IDF-ICF, TF-RF, TF-IDF-ICSDF, and TF-IGM. The results obtained from 7 different schemes show that SRTF-MONO generally outperformed other schemes for all three datasets. Moreover, TF-MONO has promised both Micro-F1 and Macro-F1 results compared to other five benchmark term weighting methods especially on the Reuters-21578 and 20-Newsgroups datasets.  相似文献   

9.
10.
在大规模信息检索领域,随着高速网络技术的迅速发展,分布式并行信息检索技术由于其高效性与经济性而受到越来越多的重视。结合基础教育资源搜索引擎的设计开发,讨论分布式并行信息检索系统中涉及的数据分布、查询任务分解及节点冗余等关键技术。  相似文献   

11.
Information Retrieval Systeme haben in den letzten Jahren nur geringe Verbesserungen in der Retrieval Performance erzielt. Wir arbeiten an neuen Ans?tzen, dem sogenannten Collaborativen Information Retrieval (CIR), die das Potential haben, starke Verbesserungen zu erreichen. CIR ist die Methode, mit der durch Ausnutzen von Informationen aus früheren Anfragen die Retrieval Peformance für die aktuelle Anfrage verbessert wird. Wir haben ein eingeschr?nktes Szenario, in dem nur alte Anfragen und dazu relevante Antwortdokumente zur Verfügung stehen. Neue Ans?tze für Methoden der Query Expansion führen unter diesen Bedingungen zu Verbesserungen der Retrieval Performance . The accuracy of ad-hoc document retrieval systems has reached a stable plateau in the last few years. We are working on so-called collaborative information retrieval (CIR) systems which have the potential to overcome the current limits. We define CIR as a task, where an information retrieval (IR) system uses information gathered from previous search processes from one or several users to improve retrieval performance for the current user searching for information. We focus on a restricted setting in CIR in which only old queries and correct answer documents to these queries are available for improving a new query. For this restricted setting we propose new approaches for query expansion procedures. We show how CIR methods can improve overall IR performance.
CR Subject Classification H.3.3  相似文献   

12.
查新检索机构需要对长期积累的用户进行忠诚度和利润分析,找出关键用户,并采取CIO(首席信息官)负责制方式向其开展推送相关情报产业链的服务,为关键用户设计并推送提醒式、关怀式和增值型的信息服务内容。对关键用户资源整合,以达到某种信息服务的目的,实现"用户资源动起来,服务内容动起来,信息人员动起来"。这样不仅实现了用户在研究型图书馆内部的有效流转,还可提升查新检索机构的可持续发展能力。  相似文献   

13.
中文期刊全文数据库检索方法与技巧   总被引:5,自引:0,他引:5  
在人类迈入信息时代的今天,掌握计算机信息检索技能,已成为各类专业人员的基本功.目前,无论是普通信息用户,还是专职检索人员,均存在着检索经验不足,检索水平不高的问题.为此,文章以国内影响最大、用户最多的2个全文数据库为例,对其检索功能及特点进行分析比较,并就如何制定、优化检索策略进行了探讨.  相似文献   

14.
Most current methods for automatic text categorization are based on supervised learning techniques and, therefore, they face the problem of requiring a great number of training instances to construct an accurate classifier. In order to tackle this problem, this paper proposes a new semi-supervised method for text categorization, which considers the automatic extraction of unlabeled examples from the Web and the application of an enriched self-training approach for the construction of the classifier. This method, even though language independent, is more pertinent for scenarios where large sets of labeled resources do not exist. That, for instance, could be the case of several application domains in different non-English languages such as Spanish. The experimental evaluation of the method was carried out in three different tasks and in two different languages. The achieved results demonstrate the applicability and usefulness of the proposed method.  相似文献   

15.
16.
艾霞  彭奇志 《图书馆杂志》2005,24(12):50-51
概述了SDOS全文电子期刊数据库,介绍了其检索技术、检索方法和检索结果的保存方式。  相似文献   

17.
ABSTRACT

This article looks at two different projects, the Google Book Search project and the Text Creation Partnership, both linked to the University of Michigan Library. Though on the surface they are very different, the Text Creation Partnership, a model of cooperation between scholars, librarians, and publishers, may have some useful lessons for how to make Google Book Search more useful for academic research, particularly in full-text searching.  相似文献   

18.
Exploring criteria for successful query expansion in the genomic domain   总被引:1,自引:0,他引:1  
Query Expansion is commonly used in Information Retrieval to overcome vocabulary mismatch issues, such as synonymy between the original query terms and a relevant document. In general, query expansion experiments exhibit mixed results. Overall TREC Genomics Track results are also mixed; however, results from the top performing systems provide strong evidence supporting the need for expansion. In this paper, we examine the conditions necessary for optimal query expansion performance with respect to two system design issues: IR framework and knowledge source used for expansion. We present a query expansion framework that improves Okapi baseline passage MAP performance by 185%. Using this framework, we compare and contrast the effectiveness of a variety of biomedical knowledge sources used by TREC 2006 Genomics Track participants for expansion. Based on the outcome of these experiments, we discuss the success factors required for effective query expansion with respect to various sources of term expansion, such as corpus-based cooccurrence statistics, pseudo-relevance feedback methods, and domain-specific and domain-independent ontologies and databases. Our results show that choice of document ranking algorithm is the most important factor affecting retrieval performance on this dataset. In addition, when an appropriate ranking algorithm is used, we find that query expansion with domain-specific knowledge sources provides an equally substantive gain in performance over a baseline system.
Nicola StokesEmail: Email:
  相似文献   

19.
Abstract

Well-chosen keywords in titles are significant in enabling optimal document retrieval. Title keyword searches employing the natural language of the researcher augment controlled vocabulary searches. Authors and researchers interested in a particular topic share a vocabulary that contains keywords useful in database searching. It is important for authors to incorporate such keywords in their titles. Both author and researcher will benefit if titles facilitate electronic access. Librarians can assist in educating authors on the benefits of using distinctive and selective keywords in titles by making guidelines available.  相似文献   

20.
A probability ranking principle for interactive information retrieval   总被引:1,自引:1,他引:0  
The classical Probability Ranking Principle (PRP) forms the theoretical basis for probabilistic Information Retrieval (IR) models, which are dominating IR theory since about 20 years. However, the assumptions underlying the PRP often do not hold, and its view is too narrow for interactive information retrieval (IIR). In this article, a new theoretical framework for interactive retrieval is proposed: The basic idea is that during IIR, a user moves between situations. In each situation, the system presents to the user a list of choices, about which s/he has to decide, and the first positive decision moves the user to a new situation. Each choice is associated with a number of cost and probability parameters. Based on these parameters, an optimum ordering of the choices can the derived—the PRP for IIR. The relationship of this rule to the classical PRP is described, and issues of further research are pointed out.
Norbert FuhrEmail:
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号