首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Direct end-user data entry and retrieval is a major factor in achieving an economical information retrieval system. To be effective, such a system would have to provide a thesaurus structure which leads novice end-users to browse subject areas before retrieval and yet provides control and coverage of terms in a domain. A faceted hierarchical thesaurus organization has been designed to accomplish this goal.  相似文献   

2.
Forty-eight years ago Maron and Kuhns published their paper, “On Relevance, Probabilistic Indexing and Information Retrieval” (1960). This was the first paper to present a probabilistic approach to information retrieval, and perhaps the first paper on ranked retrieval. Although it is one of the most widely cited papers in the field of information retrieval, many researchers today may not be familiar with its influence. This paper describes the Maron and Kuhns article and the influence that it has had on the field of information retrieval.  相似文献   

3.
4.
In Mongolian, two different alphabets are used, Cyrillic and Mongolian. In this paper, we focus solely on the Mongolian language using the Cyrillic alphabet, in which a content word can be inflected when concatenated with one or more suffixes. Identifying the original form of content words is crucial for natural language processing and information retrieval. We propose a lemmatization method for Mongolian. The advantage of our lemmatization method is that it does not rely on noun dictionaries, enabling us to lemmatize out-of-dictionary words. We also apply our method to indexing for information retrieval. We use newspaper articles and technical abstracts in experiments that show the effectiveness of our method. Our research is the first significant exploration of the effectiveness of lemmatization for information retrieval in Mongolian.  相似文献   

5.
6.
Most previous information retrieval (IR) models assume that terms of queries and documents are statistically independent from each other. However, conditional independence assumption is obviously and openly understood to be wrong, so we present a new method of incorporating term dependence into a probabilistic retrieval model by adapting a dependency structured indexing system using a dependency parse tree and Chow Expansion to compensate the weakness of the assumption. In this paper, we describe a theoretic process to apply the Chow Expansion to the general probabilistic models and the state-of-the-art 2-Poisson model. Through experiments on document collections in English and Korean, we demonstrate that the incorporation of term dependences using Chow Expansion contributes to the improvement of performance in probabilistic IR systems.  相似文献   

7.
Does human intellectual indexing have a continuing role to play in the face of increasingly sophisticated automatic indexing techniques? In this two-part essay, a computer scientist and long-time TREC participant (Pérez-Carballo) and a practitioner and teacher of human cataloging and indexing (Anderson) pursue this question by reviewing the opinions and research of leading experts on both sides of this divide. We conclude that human analysis should be used on a much more selective basis, and we offer suggestions on how these two types of indexing might be allocated to best advantage. Part one of the essay critiques the comparative research, then explores the nature of human analysis of messages or texts and efforts to formulate rules to make human practice more rigorous and predictable. We find that research comparing human vs automatic approaches has done little to change strongly held beliefs, in large part because many associated variables have not been isolated or controlled.Part II focuses on current methods in automatic indexing, its gradual adoption by major indexing and abstracting services, and ways for allocating human and machine approaches. Overall, we conclude that both approaches to indexing have been found to be effective by researchers and searchers, each with particular advantages and disadvantages. However automatic indexing has the over-arching advantage of decreasing cost, as human indexing becomes ever more expensive.  相似文献   

8.
This paper presents a size reduction method for the inverted file, the most suitable indexing structure for an information retrieval system (IRS). We notice that in an inverted file the document identifiers for a given word are usually clustered. While this clustering property can be used in reducing the size of the inverted file, good compression as well as fast decompression must both be available. In this paper, we present a method that can facilitate coding and decoding processes for interpolative coding using recursion elimination and loop unwinding. We call this method the unique-order interpolative coding. It can calculate the lower and upper bounds of every document identifier for a binary code without using a recursive process, hence the decompression time can be greatly reduced. Moreover, it also can exploit document identifier clustering to compress the inverted file efficiently. Compared with the other well-known compression methods, our method provides fast decoding speed and excellent compression. This method can also be used to support a self-indexing strategy. Therefore our research work in this paper provides a feasible way to build a fast and space-economical IRS.  相似文献   

9.
The fundamental idea of the work reported here is to extract index phrases from texts with the help of a single word concept dictionary and a thesaurus containing relations among concepts. The work is based on the fact, that, within every phrase, the single words the phrase is composed of are related in a certain well denned manner, the type of relations holding between concepts depending only on the concepts themselves. Therefore relations can be stored in a semantic network. The algorithm described extracts single word concepts from texts and combines them to phrases using the semantic relations between these concepts, which are stored in the network. The results obtained show that phrase extraction from texts by this semantic method is possible and offers many advantages over other (purely syntactic or statistic) methods concerning preciseness and completeness of the meaning representation of the text. But the results show, too, that some syntactic and morphologic “filtering” should be included for effectivity reasons.  相似文献   

10.
Does human intellectual indexing have a continuing role to play in the face of increasingly sophisticated automatic indexing techniques? In this two-part essay, a computer scientist and long-time TREC participant (Pérez-Carballo) and a practitioner and teacher of human cataloging and indexing (Anderson) pursue this question by reviewing the opinions and research of leading experts on both sides of this divide. We conclude that human analysis should be used on a much more selective basis, and we offer suggestions on how these two types indexing might be allocated to best advantage. Part I of the essay critiques the comparative research, then explores the nature of human analysis of messages or texts and efforts to formulate rules to make human practice more rigorous and predictable. We find that research comparing human versus automatic approaches has done little to change strongly held beliefs, in large part because many associated variables have not been isolated or controlled.Part II focuses on current methods in automatic indexing, its gradual adoption by major indexing and abstracting services, and ways for allocating human and machine approaches. Overall, we conclude that both approaches to indexing have been found to be effective by researchers and searchers, each with particular advantages and disadvantages. However, automatic indexing has the over-arching advantage of decreasing cost, as human indexing becomes ever more expensive.  相似文献   

11.
PERMDEX is a microcomputer program to assist in the creation of a permuted printed index which preserves the context of indexing paraphrases. Although much simpler than PRECIS, the microcomputer program was inspired by it, and uses role operators to permute terms through lead, qualifier, and display positions. Following a discussion of derivative vs assignment indexing, the use of roles, and the concept behind PRECIS, features of the program are described including indexer input and prompts, the shunting algorithm, and sorting and printing routines.  相似文献   

12.
This article is Part IV in a series of articles that report the results of a two year research program the purpose of which is to design and test intelligent information retrieval (IR) devices for undergraduates researching a social science term paper. The devices are task-facilitating, helping the undergraduate perform the task of researching and writing a term paper, and they provide a mold which takes the students’ amorphous conception of their information need, turning it into an effective query to the IR system. The present article reports the results of two studies which tested an uncertainty expansion IR device and an uncertainty reduction IR device in naturalistic settings. The devices are designed to be given at different stages of Kuhlthau’s information search process (ISP). In both studies, undergraduates were randomly assigned to either the test group, who received the device intervention, or a comparison group. In Study 1, we found that the comparison group received a higher mean mark than students in the uncertainty expansion group. In Study 2, we found that the uncertainty reduction group received a higher mean mark than the comparison group when the device was given later in the student’s ISP. We conclude that the timing of the device interventions is crucial to their potential efficacy.  相似文献   

13.
While test collections provide the cornerstone for Cranfield-based evaluation of information retrieval (IR) systems, it has become practically infeasible to rely on traditional pooling techniques to construct test collections at the scale of today’s massive document collections (e.g., ClueWeb12’s 700M+ Webpages). This has motivated a flurry of studies proposing more cost-effective yet reliable IR evaluation methods. In this paper, we propose a new intelligent topic selection method which reduces the number of search topics (and thereby costly human relevance judgments) needed for reliable IR evaluation. To rigorously assess our method, we integrate previously disparate lines of research on intelligent topic selection and deep vs. shallow judging (i.e., whether it is more cost-effective to collect many relevance judgments for a few topics or a few judgments for many topics). While prior work on intelligent topic selection has never been evaluated against shallow judging baselines, prior work on deep vs. shallow judging has largely argued for shallowed judging, but assuming random topic selection. We argue that for evaluating any topic selection method, ultimately one must ask whether it is actually useful to select topics, or should one simply perform shallow judging over many topics? In seeking a rigorous answer to this over-arching question, we conduct a comprehensive investigation over a set of relevant factors never previously studied together: 1) method of topic selection; 2) the effect of topic familiarity on human judging speed; and 3) how different topic generation processes (requiring varying human effort) impact (i) budget utilization and (ii) the resultant quality of judgments. Experiments on NIST TREC Robust 2003 and Robust 2004 test collections show that not only can we reliably evaluate IR systems with fewer topics, but also that: 1) when topics are intelligently selected, deep judging is often more cost-effective than shallow judging in evaluation reliability; and 2) topic familiarity and topic generation costs greatly impact the evaluation cost vs. reliability trade-off. Our findings challenge conventional wisdom in showing that deep judging is often preferable to shallow judging when topics are selected intelligently.  相似文献   

14.
Classaurus is a faceted hierarchic scheme of terms with vocabulary control features. It is a system of terms having separate hierarchic schedules of the Elementary Categories: Discipline, Entity, Property, and Action, together with their respective Species/Types, Parts and Special Modifiers. Also there are separate schedules for the Common Modifiers: Form, Time, Environment, and Place. Each of the terms in these hierarchic schedules is enriched with synonyms, quasi synonyms etc. The hierarchic schedules constituting the systematic part is supplemented by an alphabetical index of chain entries. Classaurus is used in the formulation of subject headings in general, and in particular, subject headings according to the Postulate based Permuted Subject Indexing (POPSI) language. For the construction of classaurus the POPSI language itself provides guidelines. A set of programs have been developed to construct a classaurus using as input, subject headings formulated according to POPSI language which are enriched with certain codes to denote the different Elementary Categories, their Species, Parts, Special Modifiers and other Common Modifiers of different kinds. The resulting classaurus has hierarchic schedules but terms in an array are arranged only alphabetically. The hierarchic schedules constitute the Systematic part of the classaurus. The system generates an alphabetic Index Part to the Systematic Part, in which for each term its broader terms are kept to its right hand side successively along with a code to denote the schedule to which the term belongs. To find out the position of a term in the Systematic Part, the whole entry for the term in the Alphabetic Part is taken and the sequence of the terms in it is reversed. Using the code for the schedule in the entry, the appropriate hierarchic schedule is selected. The schedule is then searched using the broader terms successively as keys until the term in question is reached, wherein all the hierarchically related terms could be found, including synonyms, quasi-synonyms etc. Both the Systematic Part and the Alphabetical Index Part are printed out for manual reference and also kept as direct access files for ondashline access and ondashth-spot updating and building up of the classaurus while inputting new subject headings formulated for this purpose.  相似文献   

15.
16.
In Information Retrieval (IR), the efficient indexing of terabyte-scale and larger corpora is still a difficult problem. MapReduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this work, we provide a detailed analysis of four MapReduce indexing strategies of varying complexity. Moreover, we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop MapReduce implementation, in combination with several large standard TREC test corpora. In particular, we examine the efficiency of the indexing strategies, and for the most efficient strategy, we examine how it scales with respect to corpus size, and processing power. Our results attest to both the importance of minimising data transfer between machines for IO intensive tasks like indexing, and the suitability of the per-posting list MapReduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that MapReduce is a suitable framework for the deployment of large-scale indexing.  相似文献   

17.
Reflink is a system based on sophisticated information retrieval techniques which is housed in a small, inexpensive microcomputer. The paper describes the system, along with its predecessor, REFLES. Such systems are then placed in a conceptual framework involving stable and dynamic information, and several illustrations of multiple uses are presented.  相似文献   

18.
The paper discusses the notion of steps in indexing and reveals that the document-centered approach to indexing is prevalent and argues that the document-centered approach is problematic because it blocks out context-dependent factors in the indexing process. A domain-centered approach to indexing is presented as an alternative and the paper discusses how this approach includes a broader range of analyses and how it requires a new set of actions from using this approach; analysis of the domain, users and indexers. The paper concludes that the two-step procedure to indexing is insufficient to explain the indexing process and suggests that the domain-centered approach offers a guide for indexers that can help them manage the complexity of indexing.  相似文献   

19.
As an information medium, video offers many possible retrieval and browsing modalities, far more than text, image or audio. Some of these, like searching the text of the spoken dialogue, are well developed, others like keyframe browsing tools are in their infancy, and others not yet technically achievable. For those modalities for browsing and retrieval which we cannot yet achieve we can only speculate as to how useful they will actually be, but we do not know for sure. In our work we have created a system to support multiple modalities for video browsing and retrieval including text search through the spoken dialogue, image matching against shot keyframes and object matching against segmented video objects. For the last of these, automatic segmentation and tracking of video objects is a computationally demanding problem which is not yet solved for generic natural video material, and when it is then it is expected to open up possibilities for user interaction with objects in video, including searching and browsing. In this paper we achieve object segmentation by working in a closed domain of animated cartoons. We describe an interactive user experiment on a medium-sized corpus of video where we were able to measure users’ use of video objects versus other modes of retrieval during multiple-iteration searching. Results of this experiment show that although object searching is used far less than text searching in the first iteration of a user’s search it is a popular and useful search type once an initial set of relevant shots have been found.  相似文献   

20.
An ordering system for a global information network is necessary in order to enable the user to retrieve the particular information he is looking for. Classification has been one of the methods of ordering. The principle of traditional classification has been based on the idea of partitioning the universe of knowledge in mutually exclusive classes, i.e. subjects. A particular topic is defined by narrower classification within a class following the principle of ‘genusspecies’ relationship. Ranganathan's system of faceted classification has only replaced the classification of terms into subjects and sub-subjects by classification of terms into five ambiguous categories. Taube's system of coordinate indexing gives full freedom to the user to combine any number of terms of his choice. To be effective for social sciences such a system has to overcome some difficult problems of semantics. The system MANIS described here maintains the traditional classification and yet allows the user to combine terms of his choice, where the choice is restricted to the terms belonging to the system of traditional classification.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号