首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 515 毫秒
1.
We have conducted a study to: (1) verify the exhaustiveness of pooling for the purpose of constructing a large-scale test collection, and (2) examine whether a difference in the number of pool documents can affect the relative evaluation of IR systems. We carried out the experiments using search topics, their relevance assessments, and the search results that were submitted for both the pre-test and test of the first NTCIR Workshop.Our results verified the efficiency and the effectiveness of the pooling method, the exhaustiveness of the relevance assessments, and the reliability of the evaluation using the test collection based on the pooling method.  相似文献   

2.
Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particular assessor that provided the judgments. A factor that cannot be ignored when extending conclusions made from assessors towards users, is the possible disagreement on relevance, assuming that a single gold truth label does not exist. This paper presents and analyzes the predicted relevance model (PRM), which allows predicting a particular result’s relevance for a random user, based on an observed assessment and knowledge on the average disagreement between assessors. With the PRM, existing evaluation metrics designed to measure binary assessor relevance, can be transformed into more robust and effectively graded measures that evaluate relevance towards a random user. It also leads to a principled way of quantifying multiple graded or categorical relevance levels for use as gains in established graded relevance measures, such as normalized discounted cumulative gain, which nowadays often use heuristic and data-independent gain values. Given a set of test topics with graded relevance judgments, the PRM allows evaluating systems on different scenarios, such as their capability of retrieving top results, or how well they are able to filter out non-relevant ones. Its use in actual evaluation scenarios is illustrated on several information retrieval test collections.  相似文献   

3.
The most common approach to measuring the effectiveness of Information Retrieval systems is by using test collections. The Contextual Suggestion (CS) TREC track provides an evaluation framework for systems that recommend items to users given their geographical context. The specific nature of this track allows the participating teams to identify candidate documents either from the Open Web or from the ClueWeb12 collection, a static version of the web. In the judging pool, the documents from the Open Web and ClueWeb12 collection are distinguished. Hence, each system submission should be based only on one resource, either Open Web (identified by URLs) or ClueWeb12 (identified by ids). To achieve reproducibility, ranking web pages from ClueWeb12 should be the preferred method for scientific evaluation of CS systems, but it has been found that the systems that build their suggestion algorithms on top of input taken from the Open Web achieve consistently a higher effectiveness. Because most of the systems take a rather similar approach to making CSs, this raises the question whether systems built by researchers on top of ClueWeb12 are still representative of those that would work directly on industry-strength web search engines. Do we need to sacrifice reproducibility for the sake of representativeness? We study the difference in effectiveness between Open Web systems and ClueWeb12 systems through analyzing the relevance assessments of documents identified from both the Open Web and ClueWeb12. Then, we identify documents that overlap between the relevance assessments of the Open Web and ClueWeb12, observing a dependency between relevance assessments and the source of the document being taken from the Open Web or from ClueWeb12. After that, we identify documents from the relevance assessments of the Open Web which exist in the ClueWeb12 collection but do not exist in the ClueWeb12 relevance assessments. We use these documents to expand the ClueWeb12 relevance assessments. Our main findings are twofold. First, our empirical analysis of the relevance assessments of 2  years of CS track shows that Open Web documents receive better ratings than ClueWeb12 documents, especially if we look at the documents in the overlap. Second, our approach for selecting candidate documents from ClueWeb12 collection based on information obtained from the Open Web makes an improvement step towards partially bridging the gap in effectiveness between Open Web and ClueWeb12 systems, while at the same time we achieve reproducible results on well-known representative sample of the web.  相似文献   

4.
Content-based image retrieval (CBIR) algorithms have been seen as a promising access method for digital photograph collections. Unfortunately, we have very little evidence of the usefulness of these algorithms in real user needs and contexts. In this paper, we introduce a test collection for the evaluation of CBIR algorithms. In the test collection, the performance testing is based on photograph similarity perceived by end-users in the context of realistic illustration tasks and environment. The building process and the characteristics of the resulting test collection are outlined, including a typology of similarity criteria expressed by the subjects judging the similarity of photographs. A small-scale study on the consistency of similarity assessments is presented. A case evaluation of two CBIR algorithms is reported. The results show clear correlation between the subjects' similarity assessments and the functioning of feature parameters of the tested algorithms.  相似文献   

5.
The influential Text REtrieval Conference (TREC) retrieval conference has always relied upon specialist assessors or occasionally participating groups to create relevance judgements for the tracks that it runs. Recently however, crowdsourcing has been championed as a cheap, fast and effective alternative to traditional TREC-like assessments. In 2010, TREC tracks experimented with crowdsourcing for the very first time. In this paper, we report our successful experience in creating relevance assessments for the TREC Blog track 2010 top news stories task using crowdsourcing. In particular, we crowdsourced both real-time newsworthiness assessments for news stories as well as traditional relevance assessments for blog posts. We conclude that crowdsourcing not only appears to be a feasible, but also cheap and fast means to generate relevance assessments. Furthermore, we detail our experiences running the crowdsourced evaluation of the TREC Blog track, discuss the lessons learned, and provide best practices.  相似文献   

6.
Past research has identified many different types of relevance in information retrieval (IR). So far, however, most evaluation of IR systems has been through batch experiments conducted with test collections containing only expert, topical relevance judgements. Recently, there has been some movement away from this traditional approach towards interactive, more user-centred methods of evaluation. However, these are expensive for evaluators in terms both of time and of resources. This paper describes a new evaluation methodology, using a task-oriented test collection, which combines the advantages of traditional non-interactive testing with a more user-centred emphasis. The main features of a task-oriented test collection are the adoption of the task, rather than the query, as the primary unit of evaluation and the naturalistic character of the relevance judgements.  相似文献   

7.
For a system-based information retrieval evaluation, test collection model still remains as a costly task. Producing relevance judgments is an expensive, time consuming task which has to be performed by human assessors. It is not viable to assess the relevancy of every single document in a corpus against each topic for a large collection. In an experimental-based environment, partial judgment on the basis of a pooling method is created to substitute a complete assessment of documents for relevancy. Due to the increasing number of documents, topics, and retrieval systems, the need to perform low-cost evaluations while obtaining reliable results is essential. Researchers are seeking techniques to reduce the costs of experimental IR evaluation process by the means of reducing the number of relevance judgments to be performed or even eliminating them while still obtaining reliable results. In this paper, various state-of-the-art approaches in performing low-cost retrieval evaluation are discussed under each of the following categories; selecting the best sets of documents to be judged; calculating evaluation measures, both, robust to incomplete judgments; statistical inference of evaluation metrics; inference of judgments on relevance, query selection; techniques to test the reliability of the evaluation and reusability of the constructed collections; and other alternative methods to pooling. This paper is intended to link the reader to the corpus of ‘must read’ papers in the area of low-cost evaluation of IR systems.  相似文献   

8.
This study examines the use of an ontology as a search tool. Sixteen subjects created queries using Concept-based Information Retrieval Interface (CIRI) and a regular baseline IR interface. The simulated work task method was used to make the searching situations realistic. Subjects’ search experiences, queries and search results were examined. The numbers of search concepts and keys, as well as their overlap in the queries were investigated. The effectiveness of the CIRI and baseline queries was compared. An Ontology Index (OI) was calculated for all search tasks and the correlation between the OI and the overlap of search concepts and keys in queries was investigated. The number of search keys and concepts was higher in CIRI queries than in baseline interface queries. Also the overlap of search keys was higher among CIRI users than among baseline users. These both findings are due to CIRI’s expansion feature. There was no clear correlation between OI and overlap of search concepts and keys. The search results were evaluated with generalised precision and recall, and relevance scores based on individual relevance assessments. The baseline interface queries performed better in all comparisons, but the difference was statistically significant only in relevance scores based on individual relevance assessments.  相似文献   

9.
基于用户满意度的图书馆电子资源质量评价模型研究   总被引:4,自引:0,他引:4  
提出基于用户满意度的图书馆电子资源质量评价模型,并利用该模型设计问卷对中山大学图书馆用户进行了满意度调查;在运用相关分析与回归分析方法研究模型中4个自变量与用户价值感和用户满意度之间的相关性及影响程度的基础上,对模型进行了修正,并运用象限分析法,为中山大学图书馆改进电子资源建设提供了参考性建议。  相似文献   

10.
提出一种基于D-S证据理论的遴选图书综合优先度评价方法。该方法首先建立评价过程框架和图书的三类需求模型,然后针对遴选图书,以其三类需求的需求程度作为D-S证据理论中的证据,以基本可信度分配值描述专家的不确定的需求优先度评价,最后提出一种证据权重不一致情况下的证据合成算法来完成优先度的综合评价。  相似文献   

11.
人工神经网络企业知识管理综合评价模型研究   总被引:1,自引:0,他引:1  
王悦 《图书情报工作》2011,55(18):79-82
为克服知识管理综合评价过程的随机性与评价专家主观上的不确定性,把人工神经网络技术应用于企业知识管理评价中,设计评价指标及BP网络结构,提出多指标综合评价模型。通过仿真实例对中国电子行业企业的知识管理进行综合评估,与专家评估相比较,验证该模型的有效性,并为企业知识管理的正确评价提供可能的途径。  相似文献   

12.
Many queries have multiple interpretations; they are ambiguous or underspecified. This is especially true in the context of Web search. To account for this, much recent research has focused on creating systems that produce diverse ranked lists. In order to validate these systems, several new evaluation measures have been created to quantify diversity. Ideally, diversity evaluation measures would distinguish between systems by the amount of diversity in the ranked lists they produce. Unfortunately, diversity is also a function of the collection over which the system is run and a system’s performance at ad-hoc retrieval. A ranked list built from a collection that does not cover multiple subtopics cannot be diversified; neither can a ranked list that contains no relevant documents. To ensure that we are assessing systems by their diversity, we develop (1) a family of evaluation measures that take into account the diversity of the collection and (2) a meta-evaluation measure that explicitly controls for performance. We demonstrate experimentally that our new measures can achieve substantial improvements in sensitivity to diversity without reducing discriminative power.  相似文献   

13.
A study of the Canadian Conspectus online database explored the reliability of the data and assessed the validity of the Conspectus methodology. A cross-library analysis demonstrated that the assessments for existing collection strength (ECS) and current collecting intensity (CCI) for the 51 subdivisions of psychology were not representative of the libraries' collection sizes or their materials budgets. The CCI assessments for each library for the subject subdivision for memory (BF370-BF395) were also examined. The results of correlation analyses demonstrated that the library's collection size and its materials budget were better predictors of the level of current collecting intensity (as measured by the number of recent acquisitions on memory) than the CCI assessment. Findings drawn from the literature of psychophysics and opinion measurement, as well as from relevance research, demonstrate problems with biasing effects in the category scaling methodology used in the Conspectus.  相似文献   

14.
[目的/意义]信息检索处理的是相关性的不确定性问题,但在技术层面则通常将不确定性转化为确定性的处理方法,对信息内容中存在的不确定性语义关注不多,而这一问题在某些信息检索应用场景中可能显著地影响信息检索的结果,因此针对这类不确定性语义,需要考虑针对性的处理方法。[方法/过程]提出基于D-S证据理论的不确定性语义表示方法和将这类不确定性语义特征与文本特征、主题特征相融合的检索模型,并利用公开的数据集开展实验研究,对所提出的模型进行实验。[结果/结论]D-S理论中的证据区间概念能够描述上述不确定性,多源证据融合方法也能够将这类不确定性语义特征与文本特征、主题特征融合,并通过模型训练得出理想参数,进而改进检索结果。这一模型在理论上具有包容性与可扩展性,基于该模型融合其他检索方法是进一步需研究的内容。  相似文献   

15.
《资料收集管理》2013,38(3-4):49-61
A regular, systematic collection assessment program is essential to a well-managed collection development operation. But too often librarians lack the knowledge and skills to plan and conduct such assessments. To help librarians develop their professional competence in collection evaluation, this article rccommends a collection assessment training manual. Such a manual should provide not only the practical techniques and procedures necessary to conduct, analyze, and report the assessment activities and results, but also the broader rationale needed to develop tailor-made evaluation programs to meet a variety of assessment objectives. The article also recommends and discusses a number of collection-centered and client- centered measurement techniques that should be included in the manual as well as planning and reporting considerations and suggestions for format, etc. Developing an assessment manual requires considerable effort, but the effort is well invested when the results are better collection development decisions.  相似文献   

16.
In this paper we evaluate the application of data fusion or meta-search methods, combining different algorithms and XML elements, to content-oriented retrieval of XML structured data. The primary approach is the combination of a probabilistic methods using Logistic regression and the Okapi BM-25 algorithm for estimation of document relevance or XML element relevance, in conjunction with Boolean approaches for some query elements. In the evaluation we use the INEX XML test collection to examine the relative performance of individual algorithms and elements and compare these to the performance of the data fusion approaches.  相似文献   

17.
Measuring Search Engine Quality   总被引:12,自引:3,他引:9  
The effectiveness of twenty public search engines is evaluated using TREC-inspired methods and a set of 54 queries taken from real Web search logs. The World Wide Web is taken as the test collection and a combination of crawler and text retrieval system is evaluated. The engines are compared on a range of measures derivable from binary relevance judgments of the first seven live results returned. Statistical testing reveals a significant difference between engines and high intercorrelations between measures. Surprisingly, given the dynamic nature of the Web and the time elapsed, there is also a high correlation between results of this study and a previous study by Gordon and Pathak. For nearly all engines, there is a gradual decline in precision at increasing cutoff after some initial fluctuation. Performance of the engines as a group is found to be inferior to the group of participants in the TREC-8 Large Web task, although the best engines approach the median of those systems. Shortcomings of current Web search evaluation methodology are identified and recommendations are made for future improvements. In particular, the present study and its predecessors deal with queries which are assumed to derive from a need to find a selection of documents relevant to a topic. By contrast, real Web search reflects a range of other information need types which require different judging and different measures.  相似文献   

18.
Scaling Up the TREC Collection   总被引:3,自引:3,他引:0  
Due to the popularity of Web search engines, a large proportion of real text retrieval queries are now processed over collections measured in tens or hundreds of gigabytes. A new Very Large test Collection (VLC) has been created to support qualification, measurement and comparison of systems operating at this level and to permit the study of the properties of very large collections. The VLC is an extension of the well-known TREC collection and has been distributed under the same conditions. A simple set of efficiency and effectiveness measures have been defined to encourage comparability of reporting. The 20 gigabyte first-edition of the VLC and a representative 10% sample have been used in a special interest track of the 1997 Text Retrieval Conference (TREC-6). The unaffordable cost of obtaining complete relevance assessments over collections of this scale is avoided by concentrating on early precision and relying on the core TREC collection to support detailed effectiveness studies. Results obtained by TREC-6 VLC track participants are presented here. All groups observed a significant increase in early precision as collection size increased. Explanatory hypotheses are advanced for future empirical testing. A 100 gigabyte second edition VLC (VLC2) has recently been compiled and distributed for use in TREC-7 in 1998.  相似文献   

19.
中小学图书馆馆藏是重要的课程资源。为适应中小学图书馆发展新形势、新任务、新要求,提高中小学图书馆服务教育教学能力,应当以课程方案和课程标准、《中小学图书馆(室)规程》和教育行政部门的推荐目录为依据,根据中小学校的教育教学实际,在入藏标准、馆藏数量、馆藏质量、馆藏结构和馆藏使用等方面开展探索,建立起校本化的馆藏评价机制。  相似文献   

20.
Research on cross-language information retrieval (CLIR) has typically been restricted to settings using binary relevance assessments. In this paper, we present evaluation results for dictionary-based CLIR using graded relevance assessments in a best match retrieval environment. A text database containing newspaper articles and a related set of 35 search topics were used in the tests. First, monolingual baseline queries were automatically formed from the topics. Secondly, source language topics (in English, German, and Swedish) were automatically translated into the target language (Finnish), using structured target queries. The effectiveness of the translated queries was compared to that of the monolingual queries. Thirdly, pseudo-relevance feedback was used to expand the original target queries. CLIR performance was evaluated using three relevance thresholds: stringent, regular, and liberal. When regular or liberal threshold was used, a reasonable performance was achieved. Using stringent threshold, equally high performance could not be achieved. On all the relevance thresholds the performance of the translated queries was successfully raised by pseudo-relevance feedback based query expansion. However, the performance of the stringent threshold in relation to the other thresholds could not be raised by this method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号