首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
An integrated information retrieval system generally contains multiple databases that are inconsistent in terms of their content and indexing. This paper proposes a rough set-based transfer (RST) model for integration of the concepts of document databases using various indexing languages, so that users can search through the multiple databases using any of the current indexing languages. The RST model aims to effectively create meaningful transfer relations between the terms of two indexing languages, provided a number of documents are indexed with them in parallel. In our experiment, the indexing concepts of two databases respectively using the Thesaurus of Social Science (IZ) and the Schlagwortnormdatei (SWD) are integrated by means of the RST model. Finally, this paper compares the results achieved with a cross-concordance method, a conditional probability based method and the RST model.  相似文献   

3.
The presence of clustering structure in the cystic fibrosis Document Collection is evaluated as a function of the exhaustivity of four subject representations and two citation representations. Experimental results show that for each representation the evidence for clustering structure diminishes as the exhaustivity of the representation decreases. Three of the four subject representations show no evidence of clustering structure at the least exhaustive representations. Although many documents have no references or citations, the citation representations demonstrate the presence of clustering structure over a wider range of exhaustivity levels than the subject representations. Both citation indexes show evidence of clustering structure at the least exhaustive representations. The structures imposed on the CF Document Collection by the subject and citation indexes satisfy the necessary condition for a meaningful clustering outcome.  相似文献   

4.
5.
6.
In Information Retrieval (IR), the efficient indexing of terabyte-scale and larger corpora is still a difficult problem. MapReduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this work, we provide a detailed analysis of four MapReduce indexing strategies of varying complexity. Moreover, we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop MapReduce implementation, in combination with several large standard TREC test corpora. In particular, we examine the efficiency of the indexing strategies, and for the most efficient strategy, we examine how it scales with respect to corpus size, and processing power. Our results attest to both the importance of minimising data transfer between machines for IO intensive tasks like indexing, and the suitability of the per-posting list MapReduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that MapReduce is a suitable framework for the deployment of large-scale indexing.  相似文献   

7.
The designation of overlapping hierarchies in thesauri, first outlined in 1973, is suggested as a key element in progress towards a successful man-machine partnership. An updating, expansion and theoretical background of the 1973 proposal are given. The use of the UDC, both as a matrix and a searching aid, is postulated but is not essential. Means of distinguishing overlapping terms from other “related terms” are suggested, in order to make possible the accurate representation of all hirarchical relationships. At its largest, the result could be a “Universal Reference Vocabulary”, maintained on-line only and used to construct profiles before searching via natural language and/or class numbers. It is suggested that a computer program package for a small model area within Social Sciences should be given priority.  相似文献   

8.
主题标引过程的符号语言学分析--主题标引过程步骤   总被引:2,自引:0,他引:2  
王知津  黄欣 《情报科学》2003,21(6):594-599
本文阐述了主题标引过程的三个步骤与四个要素,概述了符号语言学的基本情况,重点讨论了主题标引过程的三个步骤的本质,即文献分析、主题描述和主题分析。  相似文献   

9.
An ordering system for a global information network is necessary in order to enable the user to retrieve the particular information he is looking for. Classification has been one of the methods of ordering. The principle of traditional classification has been based on the idea of partitioning the universe of knowledge in mutually exclusive classes, i.e. subjects. A particular topic is defined by narrower classification within a class following the principle of ‘genusspecies’ relationship. Ranganathan's system of faceted classification has only replaced the classification of terms into subjects and sub-subjects by classification of terms into five ambiguous categories. Taube's system of coordinate indexing gives full freedom to the user to combine any number of terms of his choice. To be effective for social sciences such a system has to overcome some difficult problems of semantics. The system MANIS described here maintains the traditional classification and yet allows the user to combine terms of his choice, where the choice is restricted to the terms belonging to the system of traditional classification.  相似文献   

10.
Does human intellectual indexing have a continuing role to play in the face of increasingly sophisticated automatic indexing techniques? In this two-part essay, a computer scientist and long-time TREC participant (Pérez-Carballo) and a practitioner and teacher of human cataloging and indexing (Anderson) pursue this question by reviewing the opinions and research of leading experts on both sides of this divide. We conclude that human analysis should be used on a much more selective basis, and we offer suggestions on how these two types of indexing might be allocated to best advantage. Part one of the essay critiques the comparative research, then explores the nature of human analysis of messages or texts and efforts to formulate rules to make human practice more rigorous and predictable. We find that research comparing human vs automatic approaches has done little to change strongly held beliefs, in large part because many associated variables have not been isolated or controlled.Part II focuses on current methods in automatic indexing, its gradual adoption by major indexing and abstracting services, and ways for allocating human and machine approaches. Overall, we conclude that both approaches to indexing have been found to be effective by researchers and searchers, each with particular advantages and disadvantages. However automatic indexing has the over-arching advantage of decreasing cost, as human indexing becomes ever more expensive.  相似文献   

11.
Forty-eight years ago Maron and Kuhns published their paper, “On Relevance, Probabilistic Indexing and Information Retrieval” (1960). This was the first paper to present a probabilistic approach to information retrieval, and perhaps the first paper on ranked retrieval. Although it is one of the most widely cited papers in the field of information retrieval, many researchers today may not be familiar with its influence. This paper describes the Maron and Kuhns article and the influence that it has had on the field of information retrieval.  相似文献   

12.
Numerous feature-based models have been recently proposed by the information retrieval community. The capability of features to express different relevance facets (query- or document-dependent) can explain such a success story. Such models are most of the time supervised, thus requiring a learning phase. To leverage the advantages of feature-based representations of documents, we propose TournaRank, an unsupervised approach inspired by real-life game and sport competition principles. Documents compete against each other in tournaments using features as evidences of relevance. Tournaments are modeled as a sequence of matches, which involve pairs of documents playing in turn their features. Once a tournament is ended, documents are ranked according to their number of won matches during the tournament. This principle is generic since it can be applied to any collection type. It also provides great flexibility since different alternatives can be considered by changing the tournament type, the match rules, the feature set, or the strategies adopted by documents during matches. TournaRank was experimented on several collections to evaluate our model in different contexts and to compare it with related approaches such as Learning To Rank and fusion ones: the TREC Robust2004 collection for homogeneous documents, the TREC Web2014 (ClueWeb12) collection for heterogeneous web documents, and the LETOR3.0 collection for comparison with supervised feature-based models.  相似文献   

13.
PERMDEX is a microcomputer program to assist in the creation of a permuted printed index which preserves the context of indexing paraphrases. Although much simpler than PRECIS, the microcomputer program was inspired by it, and uses role operators to permute terms through lead, qualifier, and display positions. Following a discussion of derivative vs assignment indexing, the use of roles, and the concept behind PRECIS, features of the program are described including indexer input and prompts, the shunting algorithm, and sorting and printing routines.  相似文献   

14.
We present an efficient document clustering algorithm that uses a term frequency vector for each document instead of using a huge proximity matrix. The algorithm has the following features: (1) it requires a relatively small amount of memory and runs fast, (2) it produces a hierarchy in the form of a document classification tree and (3) the hierarchy obtained by the algorithm explicitly reveals a collection structure. We confirm these features and thus show the algorithm's feasibility through clustering experiments in which we use two collections of Japanese documents, the sizes of which are 83,099 and 14,701 documents. We also introduce an application of this algorithm to a document browser. This browser is used in our Japanese-to-English translation aid system. The browsing module of the system consists of a huge database of Japanese news articles and their English translations. The Japanese article collection is clustered into a hierarchy by our method. Since each node in the hierarchy corresponds to a topic in the collection, we can use the hierarchy to directly access articles by topic. A user can learn general translation knowledge of each topic by browsing the Japanese articles and their English translations. We also discuss techniques of presenting a large tree-formed hierarchy on a computer screen.  相似文献   

15.
16.
The notion of an “expert system” is briefly reviewed so as to discuss its significance and to raise a number of issues relevant to document handling. The paper is tutorial in nature and does not attempt to present any new research results, although some of the points might be new ones in this particular application context.  相似文献   

17.
18.
The text retrieval method using latent semantic indexing (LSI) technique with truncated singular value decomposition (SVD) has been intensively studied in recent years. The SVD reduces the noise contained in the original representation of the term–document matrix and improves the information retrieval accuracy. Recent studies indicate that SVD is mostly useful for small homogeneous data collections. For large inhomogeneous datasets, the performance of the SVD based text retrieval technique may deteriorate. We propose to partition a large inhomogeneous dataset into several smaller ones with clustered structure, on which we apply the truncated SVD. Our experimental results show that the clustered SVD strategies may enhance the retrieval accuracy and reduce the computing and storage costs.  相似文献   

19.
Direct end-user data entry and retrieval is a major factor in achieving an economical information retrieval system. To be effective, such a system would have to provide a thesaurus structure which leads novice end-users to browse subject areas before retrieval and yet provides control and coverage of terms in a domain. A faceted hierarchical thesaurus organization has been designed to accomplish this goal.  相似文献   

20.
Classaurus is a faceted hierarchic scheme of terms with vocabulary control features. It is a system of terms having separate hierarchic schedules of the Elementary Categories: Discipline, Entity, Property, and Action, together with their respective Species/Types, Parts and Special Modifiers. Also there are separate schedules for the Common Modifiers: Form, Time, Environment, and Place. Each of the terms in these hierarchic schedules is enriched with synonyms, quasi synonyms etc. The hierarchic schedules constituting the systematic part is supplemented by an alphabetical index of chain entries. Classaurus is used in the formulation of subject headings in general, and in particular, subject headings according to the Postulate based Permuted Subject Indexing (POPSI) language. For the construction of classaurus the POPSI language itself provides guidelines. A set of programs have been developed to construct a classaurus using as input, subject headings formulated according to POPSI language which are enriched with certain codes to denote the different Elementary Categories, their Species, Parts, Special Modifiers and other Common Modifiers of different kinds. The resulting classaurus has hierarchic schedules but terms in an array are arranged only alphabetically. The hierarchic schedules constitute the Systematic part of the classaurus. The system generates an alphabetic Index Part to the Systematic Part, in which for each term its broader terms are kept to its right hand side successively along with a code to denote the schedule to which the term belongs. To find out the position of a term in the Systematic Part, the whole entry for the term in the Alphabetic Part is taken and the sequence of the terms in it is reversed. Using the code for the schedule in the entry, the appropriate hierarchic schedule is selected. The schedule is then searched using the broader terms successively as keys until the term in question is reached, wherein all the hierarchically related terms could be found, including synonyms, quasi-synonyms etc. Both the Systematic Part and the Alphabetical Index Part are printed out for manual reference and also kept as direct access files for ondashline access and ondashth-spot updating and building up of the classaurus while inputting new subject headings formulated for this purpose.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号