共查询到20条相似文献,搜索用时 15 毫秒
1.
As the number and diversity of distributed Web databases on the Internet exponentially increase, it is difficult for user
to know which databases are appropriate to search. Given database language models that describe the content of each database,
database selection services can provide assistance in locating databases relevant to the information needs of users. In this
paper, we propose a database selection approach based on statistical language modeling. The basic idea behind the approach
is that, for databases that are categorized into a topic hierarchy, individual language models are estimated at different
search stages, and then the databases are ranked by the similarity to the query according to the estimated language model.
Two-stage smoothed language models are presented to circumvent inaccuracy due to word sparseness. Experimental results demonstrate
that such a language modeling approach is competitive with current state-of-the-art database selection approaches. 相似文献
2.
Index maintenance strategies employed by dynamic text retrieval systems based on inverted files can be divided into two categories:
merge-based and in-place update strategies. Within each category, individual update policies can be distinguished based on
whether they store their on-disk posting lists in a contiguous or in a discontiguous fashion. Contiguous inverted lists, in
general, lead to higher query performance, by minimizing the disk seek overhead at query time, while discontiguous inverted
lists lead to higher update performance, requiring less effort during index maintenance operations. In this paper, we focus
on retrieval systems with high query load, where the on-disk posting lists have to be stored in a contiguous fashion at all
times. We discuss a combination of re-merge and in-place index update, called Hybrid Immediate Merge. The method performs strictly better than the re-merge baseline policy used in our experiments, as it leads to the same query
performance, but substantially better update performance. The actual time savings achievable depend on the size of the text
collection being indexed; a larger collection results in greater savings. In our experiments, variations of Hybrid Immediate Merge were able to reduce the total index update overhead by up to 73% compared to the re-merge baseline.
相似文献
Stefan BüttcherEmail: |
3.
In retrieving medical free text, users are often interested in answers pertinent to certain scenarios that correspond to common
tasks performed in medical practice, e.g., treatment or diagnosis of a disease. A major challenge in handling such queries is that scenario terms in the query (e.g., treatment) are often too general to match specialized terms in relevant documents (e.g., chemotherapy). In this paper, we propose a knowledge-based query expansion method that exploits the UMLS knowledge source to append the
original query with additional terms that are specifically relevant to the query's scenario(s). We compared the proposed method
with traditional statistical expansion that expands terms which are statistically correlated but not necessarily scenario
specific. Our study on two standard testbeds shows that the knowledge-based method, by providing scenario-specific expansion,
yields notable improvements over the statistical method in terms of average precision-recall. On the OHSUMED testbed, for
example, the improvement is more than 5% averaging over all scenario-specific queries studied and about 10% for queries that
mention certain scenarios, such as treatment of a disease and differential diagnosis of a symptom/disease.
相似文献
Wesley W. ChuEmail: |
4.
Intelligent use of the many diverse forms of data available on the Internet requires new tools for managing and manipulating
heterogeneous forms of information. This paper uses WHIRL, an extension of relational databases that can manipulate textual
data using statistical similarity measures developed by the information retrieval community. We show that although WHIRL is
designed for more general similarity-based reasoning tasks, it is competitive with mature systems designed explicitly for
inductive classification. In particular, WHIRL is well suited for combining different sources of knowledge in the classification
process. We show on a diverse set of tasks that the use of appropriate sets of unlabeled background knowledge often decreases
error rates, particularly if the number of examples or the size of the strings in the training set is small. This is especially
useful when labeling text is a labor-intensive job and when there is a large amount of information available about a particular
problem on the World Wide Web.
相似文献
Haym HirshEmail: |
5.
This study examines the use of an ontology as a search tool. Sixteen subjects created queries using Concept-based Information
Retrieval Interface (CIRI) and a regular baseline IR interface. The simulated work task method was used to make the searching
situations realistic. Subjects’ search experiences, queries and search results were examined. The numbers of search concepts
and keys, as well as their overlap in the queries were investigated. The effectiveness of the CIRI and baseline queries was
compared. An Ontology Index (OI) was calculated for all search tasks and the correlation between the OI and the overlap of
search concepts and keys in queries was investigated. The number of search keys and concepts was higher in CIRI queries than
in baseline interface queries. Also the overlap of search keys was higher among CIRI users than among baseline users. These
both findings are due to CIRI’s expansion feature. There was no clear correlation between OI and overlap of search concepts
and keys. The search results were evaluated with generalised precision and recall, and relevance scores based on individual
relevance assessments. The baseline interface queries performed better in all comparisons, but the difference was statistically
significant only in relevance scores based on individual relevance assessments. 相似文献
6.
Result merging methods in distributed information retrieval with overlapping databases 总被引:5,自引:0,他引:5
In distributed information retrieval systems, document overlaps occur frequently among different component databases. This
paper presents an experimental investigation and evaluation of a group of result merging methods including the shadow document
method and the multi-evidence method in the environment of overlapping databases. We assume, with the exception of resultant
document lists (either with rankings or scores), no extra information about retrieval servers and text databases is available,
which is the usual case for many applications on the Internet and the Web.
The experimental results show that the shadow document method and the multi-evidence method are the two best methods when
overlap is high, while Round-robin is the best for low overlap. The experiments also show that [0,1] linear normalization
is a better option than linear regression normalization for result merging in a heterogeneous environment.
相似文献
Sally McCleanEmail: |
7.
With the help of a team of expert biologist judges, the TREC Genomics track has generated four large sets of “gold standard”
test collections, comprised of over a hundred unique topics, two kinds of ad hoc retrieval tasks, and their corresponding
relevance judgments. Over the years of the track, increasingly complex tasks necessitated the creation of judging tools and
training guidelines to accommodate teams of part-time short-term workers from a variety of specialized biological scientific
backgrounds, and to address consistency and reproducibility of the assessment process. Important lessons were learned about
factors that influenced the utility of the test collections including topic design, annotations provided by judges, methods
used for identifying and training judges, and providing a central moderator “meta-judge”. 相似文献
8.
《Journal of Informetrics》2020,14(4):101076
The effective representation of the relationship between the documents and their contents is crucial to increase classification performance of text documents in the text classification. Term weighting is a preprocess aiming to represent text documents better in Vector Space by assigning proper weights to terms. Since the calculation of the appropriate weight values directly affects performance of the text classification, in the literature, term weighting is still one of the important sub-research areas of text classification. In this study, we propose a novel term weighting (MONO) strategy which can use the non-occurrence information of terms more effectively than existing term weighting approaches in the literature. The proposed weighting strategy also performs intra-class document scaling to supply better representations of distinguishing capabilities of terms occurring in the different quantity of documents in the same quantity of class. Based on the MONO weighting strategy, two novel supervised term weighting schemes called TF-MONO and SRTF-MONO were proposed for text classification. The proposed schemes were tested with two different classifiers such as SVM and KNN on 3 different datasets named Reuters-21578, 20-Newsgroups, and WebKB. The classification performances of the proposed schemes were compared with 5 different existing term weighting schemes in the literature named TF-IDF, TF-IDF-ICF, TF-RF, TF-IDF-ICSDF, and TF-IGM. The results obtained from 7 different schemes show that SRTF-MONO generally outperformed other schemes for all three datasets. Moreover, TF-MONO has promised both Micro-F1 and Macro-F1 results compared to other five benchmark term weighting methods especially on the Reuters-21578 and 20-Newsgroups datasets. 相似文献
9.
10.
分布式并行信息检索系统的设计与实现-基础教育资源搜索引擎个案研究 总被引:1,自引:0,他引:1
在大规模信息检索领域,随着高速网络技术的迅速发展,分布式并行信息检索技术由于其高效性与经济性而受到越来越多的重视。结合基础教育资源搜索引擎的设计开发,讨论分布式并行信息检索系统中涉及的数据分布、查询任务分解及节点冗余等关键技术。 相似文献
11.
Armin Hust 《Informatik - Forschung und Entwicklung》2005,19(4):224-238
Information Retrieval Systeme haben in den letzten Jahren nur geringe Verbesserungen in der Retrieval Performance erzielt.
Wir arbeiten an neuen Ans?tzen, dem sogenannten Collaborativen Information Retrieval (CIR), die das Potential haben, starke
Verbesserungen zu erreichen. CIR ist die Methode, mit der durch Ausnutzen von Informationen aus früheren Anfragen die Retrieval
Peformance für die aktuelle Anfrage verbessert wird. Wir haben ein eingeschr?nktes Szenario, in dem nur alte Anfragen und
dazu relevante Antwortdokumente zur Verfügung stehen. Neue Ans?tze für Methoden der Query Expansion führen unter diesen Bedingungen
zu Verbesserungen der Retrieval Performance .
The accuracy of ad-hoc document retrieval systems has reached a stable plateau in the last few years. We are working on so-called
collaborative information retrieval (CIR) systems which have the potential to overcome the current limits. We define CIR as
a task, where an information retrieval (IR) system uses information gathered from previous search processes from one or several
users to improve retrieval performance for the current user searching for information. We focus on a restricted setting in
CIR in which only old queries and correct answer documents to these queries are available for improving a new query. For this
restricted setting we propose new approaches for query expansion procedures. We show how CIR methods can improve overall IR
performance.
CR Subject Classification H.3.3 相似文献
12.
查新检索机构需要对长期积累的用户进行忠诚度和利润分析,找出关键用户,并采取CIO(首席信息官)负责制方式向其开展推送相关情报产业链的服务,为关键用户设计并推送提醒式、关怀式和增值型的信息服务内容。对关键用户资源整合,以达到某种信息服务的目的,实现"用户资源动起来,服务内容动起来,信息人员动起来"。这样不仅实现了用户在研究型图书馆内部的有效流转,还可提升查新检索机构的可持续发展能力。 相似文献
13.
中文期刊全文数据库检索方法与技巧 总被引:5,自引:0,他引:5
在人类迈入信息时代的今天,掌握计算机信息检索技能,已成为各类专业人员的基本功.目前,无论是普通信息用户,还是专职检索人员,均存在着检索经验不足,检索水平不高的问题.为此,文章以国内影响最大、用户最多的2个全文数据库为例,对其检索功能及特点进行分析比较,并就如何制定、优化检索策略进行了探讨. 相似文献
14.
Rafael Guzmán-Cabrera Manuel Montes-y-Gómez Paolo Rosso Luis Villaseñor-Pineda 《Information Retrieval》2009,12(3):400-415
Most current methods for automatic text categorization are based on supervised learning techniques and, therefore, they face the problem of requiring a great number of training instances to construct an accurate classifier. In order to tackle this problem, this paper proposes a new semi-supervised method for text categorization, which considers the automatic extraction of unlabeled examples from the Web and the application of an enriched self-training approach for the construction of the classifier. This method, even though language independent, is more pertinent for scenarios where large sets of labeled resources do not exist. That, for instance, could be the case of several application domains in different non-English languages such as Spanish. The experimental evaluation of the method was carried out in three different tasks and in two different languages. The achieved results demonstrate the applicability and usefulness of the proposed method. 相似文献
15.
16.
17.
Shawn Martin 《图书馆管理杂志》2013,53(1-2):141-150
ABSTRACT This article looks at two different projects, the Google Book Search project and the Text Creation Partnership, both linked to the University of Michigan Library. Though on the surface they are very different, the Text Creation Partnership, a model of cooperation between scholars, librarians, and publishers, may have some useful lessons for how to make Google Book Search more useful for academic research, particularly in full-text searching. 相似文献
18.
Query Expansion is commonly used in Information Retrieval to overcome vocabulary mismatch issues, such as synonymy between
the original query terms and a relevant document. In general, query expansion experiments exhibit mixed results. Overall TREC
Genomics Track results are also mixed; however, results from the top performing systems provide strong evidence supporting
the need for expansion. In this paper, we examine the conditions necessary for optimal query expansion performance with respect
to two system design issues: IR framework and knowledge source used for expansion. We present a query expansion framework
that improves Okapi baseline passage MAP performance by 185%. Using this framework, we compare and contrast the effectiveness
of a variety of biomedical knowledge sources used by TREC 2006 Genomics Track participants for expansion. Based on the outcome
of these experiments, we discuss the success factors required for effective query expansion with respect to various sources
of term expansion, such as corpus-based cooccurrence statistics, pseudo-relevance feedback methods, and domain-specific and
domain-independent ontologies and databases. Our results show that choice of document ranking algorithm is the most important
factor affecting retrieval performance on this dataset. In addition, when an appropriate ranking algorithm is used, we find
that query expansion with domain-specific knowledge sources provides an equally substantive gain in performance over a baseline
system.
相似文献
Nicola StokesEmail: Email: |
19.
《Public Services Quarterly》2013,9(2):15-22
Abstract Well-chosen keywords in titles are significant in enabling optimal document retrieval. Title keyword searches employing the natural language of the researcher augment controlled vocabulary searches. Authors and researchers interested in a particular topic share a vocabulary that contains keywords useful in database searching. It is important for authors to incorporate such keywords in their titles. Both author and researcher will benefit if titles facilitate electronic access. Librarians can assist in educating authors on the benefits of using distinctive and selective keywords in titles by making guidelines available. 相似文献
20.
Norbert Fuhr 《Information Retrieval》2008,11(3):251-265
The classical Probability Ranking Principle (PRP) forms the theoretical basis for probabilistic Information Retrieval (IR)
models, which are dominating IR theory since about 20 years. However, the assumptions underlying the PRP often do not hold,
and its view is too narrow for interactive information retrieval (IIR). In this article, a new theoretical framework for interactive
retrieval is proposed: The basic idea is that during IIR, a user moves between situations. In each situation, the system presents
to the user a list of choices, about which s/he has to decide, and the first positive decision moves the user to a new situation.
Each choice is associated with a number of cost and probability parameters. Based on these parameters, an optimum ordering
of the choices can the derived—the PRP for IIR. The relationship of this rule to the classical PRP is described, and issues
of further research are pointed out.
相似文献
Norbert FuhrEmail: |