首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Collection selection is a crucial function, central to the effectiveness and efficiency of a federated information retrieval system. A variety of solutions have been proposed for collection selection adapting proven techniques used in centralised retrieval. This paper defines a new approach to collection selection that models the topical distribution in each collection. We describe an extended version of latent Dirichlet allocation that uses a hierarchical hyperprior to enable the different topical distributions found in each collection to be modelled. Under the model, resources are ranked based on the topical relationship between query and collection. By modelling collections in a low dimensional topic space, we can implicitly smooth their term-based characterisation with appropriate terms from topically related samples, thereby dealing with the problem of missing vocabulary within the samples. An important advantage of adopting this hierarchical model over current approaches is that the model generalises well to unseen documents given small samples of each collection. The latent structure of each collection can therefore be estimated well despite imperfect information for each collection such as sampled documents obtained through query-based sampling. Experiments demonstrate that this new, fully integrated topical model is more robust than current state of the art collection selection algorithms.  相似文献   

Entity Retrieval (ER)—in comparison to classical search—aims at finding individual entities instead of relevant documents. Finding a list of entities requires therefore techniques different to classical search engines. In this paper, we present a model to describe entities more formally and how an ER system can be build on top of it. We compare different approaches designed for finding entities in Wikipedia and report on results using standard test collections. An analysis of entity-centric queries reveals different aspects and problems related to ER and shows limitations of current systems performing ER with Wikipedia. It also indicates which approaches are suitable for which kinds of queries.  相似文献   

We evaluate five term scoring methods for automatic term extraction on four different types of text collections: personal document collections, news articles, scientific articles and medical discharge summaries. Each collection has its own use case: author profiling, boolean query term suggestion, personalized query suggestion and patient query expansion. The methods for term scoring that have been proposed in the literature were designed with a specific goal in mind. However, it is as yet unclear how these methods perform on collections with characteristics different than what they were designed for, and which method is the most suitable for a given (new) collection. In a series of experiments, we evaluate, compare and analyse the output of six term scoring methods for the collections at hand. We found that the most important factors in the success of a term scoring method are the size of the collection and the importance of multi-word terms in the domain. Larger collections lead to better terms; all methods are hindered by small collection sizes (below 1000 words). The most flexible method for the extraction of single-word and multi-word terms is pointwise Kullback–Leibler divergence for informativeness and phraseness. Overall, we have shown that extracting relevant terms using unsupervised term scoring methods is possible in diverse use cases, and that the methods are applicable in more contexts than their original design purpose.  相似文献   

Academic health sciences libraries in the United States and Canada were surveyed regarding collection development trends, including their effect on approval plan and blanket order use, and use of outsourcing over the past four years. Results of the survey indicate that serials market forces, budgetary constraints, and growth in electronic resources purchasing have resulted in a decline in the acquisition of print items. As a result, approval plan use is being curtailed in many academic health sciences libraries. Although use of blanket orders is more stable, fewer than one-third of academic health sciences libraries report using them currently. The decline of print collections suggests that libraries should explore cooperative collection development of print materials to ensure access and preservation. The decline of approval plan use and the need for cooperative collection development may require additional effort for sound collection development. Libraries were also surveyed about their use of outsourcing. Some libraries reported outsourcing cataloging and shelf preparation of books, but none reported using outsourcing for resource selection. The reason given most often for outsourcing was that it resulted in cost savings. As expected, economic factors are driving both collection development and outsourcing practices.  相似文献   

The TREC 2009 web ad hoc and relevance feedback tasks used a new document collection, the ClueWeb09 dataset, which was crawled from the general web in early 2009. This dataset contains 1 billion web pages, a substantial fraction of which are spam—pages designed to deceive search engines so as to deliver an unwanted payload. We examine the effect of spam on the results of the TREC 2009 web ad hoc and relevance feedback tasks, which used the ClueWeb09 dataset. We show that a simple content-based classifier with minimal training is efficient enough to rank the “spamminess” of every page in the dataset using a standard personal computer in 48 hours, and effective enough to yield significant and substantive improvements in the fixed-cutoff precision (estP10) as well as rank measures (estR-Precision, StatMAP, MAP) of nearly all submitted runs. Moreover, using a set of “honeypot” queries the labeling of training data may be reduced to an entirely automatic process. The results of classical information retrieval methods are particularly enhanced by filtering—from among the worst to among the best.  相似文献   

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple \({\textsf{tf}}{\textsf{-}}{\textsf{idf}}\) model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.  相似文献   

On Collection Size and Retrieval Effectiveness   总被引:3,自引:0,他引:3  
The relationship between collection size and retrieval effectiveness is particularly important in the context of Web search. We investigate it first analytically and then experimentally, using samples and subsets of test collections. Different retrieval systems vary in how the score assigned to an individual document in a sample collection relates to the score it receives in the full collection; we identify four cases.We apply signal detection (SD) theory to retrieval from samples, taking into account the four cases and using a variety of shapes for relevant and irrelevant distributions. We note that the SD model subsumes several earlier hypotheses about the causes of the decreased precision in samples. We also discuss other models which contribute to an understanding of the phenomenon, particularly relating to the effects of discreteness. Different models provide complementary insights.Extensive use is made of test data, some from official submissions to the TREC-6 VLC track and some new, to illustrate the effects and test hypotheses. We empirically confirm predictions, based on SD theory, that P@n should decline when moving to a sample collection and that average precision and R-precision should remain constant. SD theory suggests the use of recall-fallout plots as operating characteristic (OC) curves. We plot OC curves of this type for a real retrieval system and query set and show that curves for sample collections are similar but not identical to the curve for the full collection.  相似文献   

Museum education collections are inarguably a part of a museum's actual collection, just as are the research/permanent collections. However, past practices indicate that education collections are typically not given equal stature in museological terms. This paper argues that techniques and practices used with research/permanent collections should be applied to education collections, a viewpoint that has not yet been readily embraced. Several methods are addressed for upgrading an education collection to a level similar to a museum's permanent collection. The Lubbock Lake Landmark's education collection serves as a case study to demonstrate the need for the application of proper museological techniques to conform to best practices. A scope of collection was created, preventive conservation techniques were applied, a gap analysis was performed, and legal issues concerning the education collection were addressed.  相似文献   

Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, where many documents exist that were never examined by the relevance assessors, is receiving a lot of attention. This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs—the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task. Following previous work, we artificially reduce the original relevance data to simulate IR evaluation environments with extremely incomplete relevance data. We then investigate the effect of this reduction on discriminative power, which we define as the proportion of system pairs with a statistically significant difference for a given probability of Type I Error, and on Kendall’s rank correlation, which reflects the overall resemblance of two system rankings according to two different metrics or two different relevance data sets. According to these experiments, Q′, nDCG′ and AP′ proposed by Sakai are superior to bpref proposed by Buckley and Voorhees and to Rank-Biased Precision proposed by Moffat and Zobel. We also point out some weaknesses of bpref and Rank-Biased Precision by examining their formal definitions.
Noriko KandoEmail:


The significance of historical periodizations is explored for the purpose of collection development. As historians often rely upon utilizing historical periods for discussion and research in their scholarship, an understanding of the nature of historical periods as a collection perspective and approach can be useful for the subject librarian engaged in or responsible for the history of science collections. An examination of the last 20 years of ISIS bibliographies using historical periods as a chronological tool provides useful framework and knowledge for research, scholarship, and collection management, especially selection for library collections, of the history of science. Data revealed periodizations and their emphases, evolution over time, and possible scholarly directions, pertinent for collection purposes. Informed knowledge of periodizations and how they may affect collections, acquisitions, and instruction enables librarians to effectively interact with the specialists in the history of science.  相似文献   

The heart of the library lies in its collections, and collections have to be built continuously. Budgetary constraints perforce stress the need for better defined collection development policy, although the ultimate goal should be an improvement of library service rather than any reduction of library cost. A written collection development policy facilitates a consistent and balanced growth of library resources, and a dynamic policy is one that evolves as the institution grows. Such a policy is based on the understanding of the needs of the community it serves and seeks to define and delimit the goals and objectives of the institution. A collection development statement is not a substitute for book selection; it charts the forest but does not plant the trees. It should be used as a guidepost, not a crutch. Book selection requires judgment and the courage to choose. A sound collection development policy, on the other hand, provides the necessary rational without which a collection may grow amoebalike, by means of pseudopodia.  相似文献   

《The Reference Librarian》2013,54(22):113-124
Information brokers can exercise a subtle but significant influence on a library's collection development practices through the reference tools they produce. Traditionally, reference sources' use in collection evaluation and their capacity io direct patron demand have helped shape collection building. New optical disc reference devices may have a more pronounced impact on general collections than their predecessors. The expense of the optical systems will likely reduce general collection budgets, while their potentially heavy usage could cause changes in collecting patterns. To insure that these changes will not diminish the library's abilily to serve its community, selectors will need to evaluate each system in full recognition of broadest possible implications. Careful selection will require clearly established reference goals consistent with the library's mission.  相似文献   

《The Reference Librarian》2013,54(29):145-157
A recent survey revealed that very few libraries were doing any significant weeding in their reference collections. Yet weeding the reference collection may have as great an impact on the library as a whole as weeding the general collection. This article discusses some of the author's experiences weeding the reference collection at Georgia State University and the ways in which general weeding concerns, such as planning for staffing, establishing criteria, and formulating selection policies, may be applied to the reference collection. Reference materials differ in nature and use from materials in the general collection, and this article analyses the impact these differences have on the weeding effort.  相似文献   

关汉华 《图书馆论坛》2007,27(6):180-182,252
我国是世界上保存图籍最为宏富的国家之一,这与历代重视典籍分不开,自汉、魏以来就设立秘书监进行专职管理。明代对此有所因革,朝廷典藏改由翰林典籍管理。其初由于朝廷重视,典籍发挥了身处翰林的特定优势,进展良好;自明中叶后,随着政局转坏,典藏亦渐趋衰败。其经验教训,具有重要的启示意义。  相似文献   

This article examines methods of selection and acquisition for European (as opposed to Canadian) French-language print monographs for a research-level law library collection in North America based on the study of the practices and techniques locally developed in the Nahum Gelber Law Library, McGill University. These techniques represent a combination of non–approval plan–based methods: blanket order, standing orders, online slip service in WorldCat, use of other libraries’ acquisition lists, and firm orders. The methods described in this article are not exclusive to the selection in the subject area of law and could be applied in other academic disciplines.  相似文献   

For many research libraries, remote storage of collections is commonplace and an inevitable strategy for handling collection growth. Despite the prevalence of storage collections, little documentation of collection management strategies vis à vis storage exists in the literature. The processes employed at the University of Michigan Library for collection transfers are described, including an analysis of collection use data to develop criteria for storage. In addition, collection management issues which arise as storage collections mature are identified and questions for future storage programs raised.  相似文献   

This article reports the findings of a survey conducted in 1986 of music collections in academic libraries of the Association of Research Libraries. There were five sections in the survey instrument: (1) General Information; (2) Staffing and Location of the Music Collection; (3) Collection Development Policies and Procedures; (4) Collection Evaluation Policies and Procedures; and (5) Tables. These sections correspond to the major questions which initiated the survey: (I) Where did Louisiana State University (LSU) rank in music holdings and budget among the respondents? (2) In relation to collection size and budget how did LSU compare in staffing, and where did the majority of other ARL libraries house their material? (3) How many libraries had collection development policies as guidelines for developing their collections? (4) How many libraries had formal collection evaluation procedures, and what bibliographic sources were used for this (5) How did other institutions view the provision of chamber music "performance parts" and "elementary/secondary classroom materials" in a research collection, and what were the collection preferences for certain book, non-book, and score materials?  相似文献   

The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. This paper shows how to exploit locality by building, using, and searching partial replicas of text collections in a distributed IR system. In this work, a partial replica includes a subset of the documents from larger collection(s) and the corresponding inference network search mechanism. For each query, the distributed system determines if partial replica is a good match and then searches it, or it searches the original collection. We demonstrate the scenarios where partial replication performs better than systems that use caches which only store previous query and answer pairs. We first use logs from THOMAS and Excite to examine query locality using query similarity versus exact match. We show that searching replicas can improve locality (from 3 to 19%) over the exact match required by caching. Replicas increase locality because they satisfy queries which are distinct but return the same or very similar answers. We then present a novel inference network replica selection function. We vary its parameters and compare it to previous collection selection functions, demonstrating a configuration that directs most of the appropriate queries to replicas in a replica hierarchy. We then explore the performance of partial replication in a distributed IR system. We compare it with caching and partitioning. Our validated simulator shows that the increases in locality due to replication make it preferable to caching alone, and that even a small increase of 4% in locality translates into a performance advantage. We also show a hybrid system with caches and replicas that performs better than each on their own.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号