首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
This paper presents a size reduction method for the inverted file, the most suitable indexing structure for an information retrieval system (IRS). We notice that in an inverted file the document identifiers for a given word are usually clustered. While this clustering property can be used in reducing the size of the inverted file, good compression as well as fast decompression must both be available. In this paper, we present a method that can facilitate coding and decoding processes for interpolative coding using recursion elimination and loop unwinding. We call this method the unique-order interpolative coding. It can calculate the lower and upper bounds of every document identifier for a binary code without using a recursive process, hence the decompression time can be greatly reduced. Moreover, it also can exploit document identifier clustering to compress the inverted file efficiently. Compared with the other well-known compression methods, our method provides fast decoding speed and excellent compression. This method can also be used to support a self-indexing strategy. Therefore our research work in this paper provides a feasible way to build a fast and space-economical IRS.  相似文献   

3.
Measuring effectiveness of information retrieval (IR) systems is essential for research and development and for monitoring search quality in dynamic environments. In this study, we employ new methods for automatic ranking of retrieval systems. In these methods, we merge the retrieval results of multiple systems using various data fusion algorithms, use the top-ranked documents in the merged result as the “(pseudo) relevant documents,” and employ these documents to evaluate and rank the systems. Experiments using Text REtrieval Conference (TREC) data provide statistically significant strong correlations with human-based assessments of the same systems. We hypothesize that the selection of systems that would return documents different from the majority could eliminate the ordinary systems from data fusion and provide better discrimination among the documents and systems. This could improve the effectiveness of automatic ranking. Based on this intuition, we introduce a new method for the selection of systems to be used for data fusion. For this purpose, we use the bias concept that measures the deviation of a system from the norm or majority and employ the systems with higher bias in the data fusion process. This approach provides even higher correlations with the human-based results. We demonstrate that our approach outperforms the previously proposed automatic ranking methods.  相似文献   

4.
Most previous information retrieval (IR) models assume that terms of queries and documents are statistically independent from each other. However, conditional independence assumption is obviously and openly understood to be wrong, so we present a new method of incorporating term dependence into a probabilistic retrieval model by adapting a dependency structured indexing system using a dependency parse tree and Chow Expansion to compensate the weakness of the assumption. In this paper, we describe a theoretic process to apply the Chow Expansion to the general probabilistic models and the state-of-the-art 2-Poisson model. Through experiments on document collections in English and Korean, we demonstrate that the incorporation of term dependences using Chow Expansion contributes to the improvement of performance in probabilistic IR systems.  相似文献   

5.
The success of information retrieval depends on the ability to measure the effective relationship between a query and its response. If both are posed in natural language, one might expect that understanding the meaning of that language could not be avoided. The aim of this research is to demonstrate that it is perhaps unnecessary to be able to determine the meaning in the absolute sense; it may be sufficient to measure how far there is a conformity in meaning, and then only in the context of the set of documents in which the answer to a query is sought. Handling a particular language using a computer is made possible through replacing certain texts by special sets. A given text has a ‘syntactic trace’, the set of all the overlapping trigrams forming part of the text. When determining the effective relationship between a query and its answer, not only do their syntactic traces play a role, but so do the traces of all other documents in the set. This is known as the ‘information trace method’.  相似文献   

6.
7.
Forty-eight years ago Maron and Kuhns published their paper, “On Relevance, Probabilistic Indexing and Information Retrieval” (1960). This was the first paper to present a probabilistic approach to information retrieval, and perhaps the first paper on ranked retrieval. Although it is one of the most widely cited papers in the field of information retrieval, many researchers today may not be familiar with its influence. This paper describes the Maron and Kuhns article and the influence that it has had on the field of information retrieval.  相似文献   

8.
In this study, quantitative measures of the information content of textual material have been developed based upon analysis of the linguistic structure of the sentences in the text. It has been possible to measure such properties as: (1) the amount of information contributed by a sentence to the discourse; (2) the complexity of the information within the sentence, including the overall logical structure and the contributions of local modifiers; (3) the density of information based on the ratio of the number of words in a sentence to the number of information-contributing operators.Two contrasting types of texts were used to develop the measures. The measures were then applied to contrasting sentences within one type of text. The textual material was drawn from narrative patient records and from the medical research literature. Sentences from the records were analyzed by computer and those from the literature were analyzed manually, using the same methods of analysis. The results show that quantitative measures of properties of textual information can be developed which accord with intuitively perceived differences in the informational complexity of the material.  相似文献   

9.
We analyzed natural language document retrieval queries from the Thomas Cooper Library at the University of South Carolina in order to investigate the frequency of various types of ill-formed input, such as spelling errors, co-occurrence violations, conjunctions, ellipsis and missing or incorrect punctuation. The primary reason for analyzing ill-formed inputs was to determine whether there is a significant need to study ill-formed inputs in detail. After analyzing the queries, we found that most of the queries were sentence fragments and that many of them contained some type of ill-formed input. Conjunctions caused the most problems. The next most serious problem was caused by punctuation errors. Spelling errors occurred in a small number of the queries. The remaining types of ill-formed input considered, ellipsis and co-occurrence violations, were not found in the queries.  相似文献   

10.
Does human intellectual indexing have a continuing role to play in the face of increasingly sophisticated automatic indexing techniques? In this two-part essay, a computer scientist and long-time TREC participant (Pérez-Carballo) and a practitioner and teacher of human cataloging and indexing (Anderson) pursue this question by reviewing the opinions and research of leading experts on both sides of this divide. We conclude that human analysis should be used on a much more selective basis, and we offer suggestions on how these two types of indexing might be allocated to best advantage. Part one of the essay critiques the comparative research, then explores the nature of human analysis of messages or texts and efforts to formulate rules to make human practice more rigorous and predictable. We find that research comparing human vs automatic approaches has done little to change strongly held beliefs, in large part because many associated variables have not been isolated or controlled.Part II focuses on current methods in automatic indexing, its gradual adoption by major indexing and abstracting services, and ways for allocating human and machine approaches. Overall, we conclude that both approaches to indexing have been found to be effective by researchers and searchers, each with particular advantages and disadvantages. However automatic indexing has the over-arching advantage of decreasing cost, as human indexing becomes ever more expensive.  相似文献   

11.
In Mongolian, two different alphabets are used, Cyrillic and Mongolian. In this paper, we focus solely on the Mongolian language using the Cyrillic alphabet, in which a content word can be inflected when concatenated with one or more suffixes. Identifying the original form of content words is crucial for natural language processing and information retrieval. We propose a lemmatization method for Mongolian. The advantage of our lemmatization method is that it does not rely on noun dictionaries, enabling us to lemmatize out-of-dictionary words. We also apply our method to indexing for information retrieval. We use newspaper articles and technical abstracts in experiments that show the effectiveness of our method. Our research is the first significant exploration of the effectiveness of lemmatization for information retrieval in Mongolian.  相似文献   

12.
In this article we discuss the need for distributed information retrieval Systems. A number of possible configurations are presented. A general approach to the design of such systems is discussed. A prototype implementation is described together with the experiences gained from this implementation.  相似文献   

13.
Modern information retrieval systems are designed to supply relevant information in response to requests received from the user population. In most retrieval environments the search requests consist of keywords, or index terms, interrelated by appropriate Boolean operators. Since it is difficult for untrained users to generate effective Boolean search requests, trained search intermediaries are normally used to translate original statements of user need into useful Boolean search formulations. Methods are introduced in this study which reduce the role of the search intermediaries by making it possible to generate Boolean search formulations completely automatically from natural language statements provided by the system patrons. Frequency considerations are used automatically to generate appropriate term combinations as well as Boolean connectives relating the terms. Methods are covered to produce automatic query formulations both in a standard Boolean logic system, as well as in an extended Boolean system in which the strict interpretation of the connectives is relaxed. Experimental results are supplied to evaluate the effectiveness of the automatic query formulation process, and methods are described for applying the automatic query formulation process in practice.  相似文献   

14.
Natural Language Processing (NLP) techniques have been successfully used to automatically extract information from unstructured text through a detailed analysis of their content, often to satisfy particular information needs. In this paper, an automatic concept map construction technique, Fuzzy Association Concept Mapping (FACM), is proposed for the conversion of abstracted short texts into concept maps. The approach consists of a linguistic module and a recommendation module. The linguistic module is a text mining method that does not require the use to have any prior knowledge about using NLP techniques. It incorporates rule-based reasoning (RBR) and case based reasoning (CBR) for anaphoric resolution. It aims at extracting the propositions in text so as to construct a concept map automatically. The recommendation module is arrived at by adopting fuzzy set theories. It is an interactive process which provides suggestions of propositions for further human refinement of the automatically generated concept maps. The suggested propositions are relationships among the concepts which are not explicitly found in the paragraphs. This technique helps to stimulate individual reflection and generate new knowledge. Evaluation was carried out by using the Science Citation Index (SCI) abstract database and CNET News as test data, which are well known databases and the quality of the text is assured. Experimental results show that the automatically generated concept maps conform to the outputs generated manually by domain experts, since the degree of difference between them is proportionally small. The method provides users with the ability to convert scientific and short texts into a structured format which can be easily processed by computer. Moreover, it provides knowledge workers with extra time to re-think their written text and to view their knowledge from another angle.  相似文献   

15.
Does human intellectual indexing have a continuing role to play in the face of increasingly sophisticated automatic indexing techniques? In this two-part essay, a computer scientist and long-time TREC participant (Pérez-Carballo) and a practitioner and teacher of human cataloging and indexing (Anderson) pursue this question by reviewing the opinions and research of leading experts on both sides of this divide. We conclude that human analysis should be used on a much more selective basis, and we offer suggestions on how these two types indexing might be allocated to best advantage. Part I of the essay critiques the comparative research, then explores the nature of human analysis of messages or texts and efforts to formulate rules to make human practice more rigorous and predictable. We find that research comparing human versus automatic approaches has done little to change strongly held beliefs, in large part because many associated variables have not been isolated or controlled.Part II focuses on current methods in automatic indexing, its gradual adoption by major indexing and abstracting services, and ways for allocating human and machine approaches. Overall, we conclude that both approaches to indexing have been found to be effective by researchers and searchers, each with particular advantages and disadvantages. However, automatic indexing has the over-arching advantage of decreasing cost, as human indexing becomes ever more expensive.  相似文献   

16.
Direct end-user data entry and retrieval is a major factor in achieving an economical information retrieval system. To be effective, such a system would have to provide a thesaurus structure which leads novice end-users to browse subject areas before retrieval and yet provides control and coverage of terms in a domain. A faceted hierarchical thesaurus organization has been designed to accomplish this goal.  相似文献   

17.
A survey is given of the potential role of artificial intelligence in retrieval systems. Papers by Bush and Turing are used to introduce early ideas in the two fields and definitions for artificial intelligence and information retrieval for the purposes of this paper are given. A simple model of an information retrieval system provides a framework for subsequent discussion of artificial intelligence concepts and their applicability in information retrieval. Concepts surveyed include pattern recognition, representation, problem solving and planning, heuristics, and learning. The paper concludes with an outline of areas for further research on artificial intelligence in information retrieval systems.  相似文献   

18.
Two probabilistic approaches to cross-lingual retrieval are in wide use today, those based on probabilistic models of relevance, as exemplified by INQUERY, and those based on language modeling. INQUERY, as a query net model, allows the easy incorporation of query operators, including a synonym operator, which has proven to be extremely useful in cross-language information retrieval (CLIR), in an approach often called structured query translation. In contrast, language models incorporate translation probabilities into a unified framework. We compare the two approaches on Arabic and Spanish data sets, using two kinds of bilingual dictionaries––one derived from a conventional dictionary, and one derived from a parallel corpus. We find that structured query processing gives slightly better results when queries are not expanded. On the other hand, when queries are expanded, language modeling gives better results, but only when using a probabilistic dictionary derived from a parallel corpus.We pursue two additional issues inherent in the comparison of structured query processing with language modeling. The first concerns query expansion, and the second is the role of translation probabilities. We compare conventional expansion techniques (pseudo-relevance feedback) with relevance modeling, a new IR approach which fits into the formal framework of language modeling. We find that relevance modeling and pseudo-relevance feedback achieve comparable levels of retrieval and that good translation probabilities confer a small but significant advantage.  相似文献   

19.
Nowadays, access to information requires managing multimedia databases effectively, and so, multi-modal retrieval techniques (particularly images retrieval) have become an active research direction. In the past few years, a lot of content-based image retrieval (CBIR) systems have been developed. However, despite the progress achieved in the CBIR, the retrieval accuracy of current systems is still limited and often worse than only textual information retrieval systems. In this paper, we propose to combine content-based and text-based approaches to multi-modal retrieval in order to achieve better results and overcome the lacks of these techniques when they are taken separately. For this purpose, we use a medical collection that includes both images and non-structured text. We retrieve images from a CBIR system and textual information through a traditional information retrieval system. Then, we combine the results obtained from both systems in order to improve the final performance. Furthermore, we use the information gain (IG) measure to reduce and improve the textual information included in multi-modal information retrieval systems. We have carried out several experiments that combine this reduction technique with a visual and textual information merger. The results obtained are highly promising and show the profit obtained when textual information is managed to improve conventional multi-modal systems.  相似文献   

20.
Users enter queries that are short as well as long. The aim of this work is to evaluate techniques that can enable information retrieval (IR) systems to automatically adapt to perform better on such queries. By adaptation we refer to (1) modifications to the queries via user interaction, and (2) detecting that the original query is not a good candidate for modification. We show that the former has the potential to improve mean average precision (MAP) of long and short queries by 40% and 30% respectively, and that simple user interaction can help towards this goal. We observed that after inspecting the options presented to them, users frequently did not select any. We present techniques in this paper to determine beforehand the utility of user interaction to avoid this waste of time and effort. We show that our techniques can provide IR systems with the ability to detect and avoid interaction for unpromising queries without a significant drop in overall performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号