共查询到20条相似文献,搜索用时 15 毫秒
1.
Qinglei Wang Yanan Qian Ruihua Song Zhicheng Dou Fan Zhang Tetsuya Sakai Qinghua Zheng 《Information Retrieval》2013,16(4):484-503
Web search queries are often ambiguous or faceted, and the task of identifying the major underlying senses and facets of queries has received much attention in recent years. We refer to this task as query subtopic mining. In this paper, we propose to use surrounding text of query terms in top retrieved documents to mine subtopics and rank them. We first extract text fragments containing query terms from different parts of documents. Then we group similar text fragments into clusters and generate a readable subtopic for each cluster. Based on the cluster and the language model trained from a query log, we calculate three features and combine them into a relevance score for each subtopic. Subtopics are finally ranked by balancing relevance and novelty. Our evaluation experiments with the NTCIR-9 INTENT Chinese Subtopic Mining test collection show that our method significantly outperforms a query log based method proposed by Radlinski et al. (2010) and a search result clustering based method proposed by Zeng et al. (2004) in terms of precision, I-rec, D-nDCG and D#-nDCG, the official evaluation metrics used at the NTCIR-9 INTENT task. Moreover, our generated subtopics are significantly more readable than those generated by the search result clustering method. 相似文献
2.
Search engine results are often biased towards a certain aspect of a query or towards a certain meaning for ambiguous query terms. Diversification of search results offers a way to supply the user with a better balanced result set increasing the probability that a user finds at least one document suiting her information need. In this paper, we present a reranking approach based on minimizing variance of Web search results to improve topic coverage in the top-k results. We investigate two different document representations as the basis for reranking. Smoothed language models and topic models derived by Latent Dirichlet?allocation. To evaluate our approach we selected 240 queries from Wikipedia disambiguation pages. This provides us with ambiguous queries together with a community generated balanced representation of their (sub)topics. For these queries we crawled two major commercial search engines. In addition, we present a new evaluation strategy based on Kullback-Leibler divergence and Wikipedia. We evaluate this method using the TREC sub-topic evaluation on the one hand, and manually annotated query results on the other hand. Our results show that minimizing variance in search results by reranking relevant pages significantly improves topic coverage in the top-k results with respect to Wikipedia, and gives a good overview of the overall search result. Moreover, latent topic models achieve competitive diversification with significantly less reranking. Finally, our evaluation reveals that our automatic evaluation strategy using Kullback-Leibler divergence correlates well with α-nDCG scores used in manual evaluation efforts. 相似文献
3.
While past research has shown that learning outcomes can be influenced by the amount of effort students invest during the learning process, there has been little research into this question for scenarios where people use search engines to learn. In fact, learning-related tasks represent a significant fraction of the time users spend using Web search, so methods for evaluating and optimizing search engines to maximize learning are likely to have broad impact. Thus, we introduce and evaluate a retrieval algorithm designed to maximize educational utility for a vocabulary learning task, in which users learn a set of important keywords for a given topic by reading representative documents on diverse aspects of the topic. Using a crowdsourced pilot study, we compare the learning outcomes of users across four conditions corresponding to rankings that optimize for different levels of keyword density. We find that adding keyword density to the retrieval objective gave significant learning gains on some topics, with higher levels of keyword density generally corresponding to more time spent reading per word, and stronger learning gains per word read. We conclude that our approach to optimizing search ranking for educational utility leads to retrieved document sets that ultimately may result in more efficient learning of important concepts. 相似文献
4.
We present a novel approach to re-ranking a document list that was retrieved in response to a query so as to improve precision
at the very top ranks. The approach is based on utilizing a second list that was retrieved in response to the query by using,
for example, a different retrieval method and/or query representation. In contrast to commonly-used methods for fusion of retrieved lists that rely solely on retrieval scores (ranks) of documents, our approach also exploits inter-document-similarities between the lists—a potentially rich source of additional information. Empirical evaluation shows that our methods are effective
in re-ranking TREC runs; the resultant performance also favorably compares with that of a highly effective fusion method.
Furthermore, we show that our methods can potentially help to tackle a long-standing challenge, namely, integration of document-based
and cluster-based retrieved results. 相似文献
5.
Oren Kurland 《Information Retrieval》2009,12(4):437-460
To obtain high precision at top ranks by a search performed in response to a query, researchers have proposed a cluster-based
re-ranking paradigm: clustering an initial list of documents that are the most highly ranked by some initial search, and using
information induced from these (often called) query-specific clusters for re-ranking the list. However, results concerning the effectiveness of various automatic cluster-based re-ranking methods have been inconclusive. We show that using query-specific clusters for automatic re-ranking
of top-retrieved documents is effective with several methods in which clusters play different roles, among which is the smoothing of document language models. We do so by adapting previously-proposed cluster-based retrieval approaches, which are based on (static) query-independent
clusters for ranking all documents in a corpus, to the re-ranking setting wherein clusters are query-specific. The best performing
method that we develop outperforms both the initial document-based ranking and some previously proposed cluster-based re-ranking
approaches; furthermore, this algorithm consistently outperforms a state-of-the-art pseudo-feedback-based approach. In further
exploration we study the performance of cluster-based smoothing methods for re-ranking with various (soft and hard) clustering
algorithms, and demonstrate the importance of clusters in providing context from the initial list through a comparison to
using single documents to this end.
相似文献
Oren KurlandEmail: |
6.
Bing and Google customize their results to target people with different geographic locations and languages but, despite the importance of search engines for web users and webometric research, the extent and nature of these differences are unknown. This study compares the results of seventeen random queries submitted automatically to Bing for thirteen different English geographic search markets at monthly intervals. Search market choice alters a small majority of the top 10 results but less than a third of the complete sets of results. Variation in the top 10 results over a month was about the same as variation between search markets but variation over time was greater for the complete results sets. Most worryingly for users, there were almost no ubiquitous authoritative results: only one URL was always returned in the top 10 for all search markets and points in time, and Wikipedia was almost completely absent from the most common top 10 results. Most importantly for webometrics, results from at least three different search markets should be combined to give more reliable and comprehensive results, even for queries that return fewer than the maximum number of URLs. 相似文献
7.
Olivier Chapelle Shihao Ji Ciya Liao Emre Velipasaoglu Larry Lai Su-Lin Wu 《Information Retrieval》2011,14(6):572-592
We study the problem of web search result diversification in the case where intent based relevance scores are available. A
diversified search result will hopefully satisfy the information need of user-L.s who may have different intents. In this
context, we first analyze the properties of an intent-based metric, ERR-IA, to measure relevance and diversity altogether.
We argue that this is a better metric than some previously proposed intent aware metrics and show that it has a better correlation
with abandonment rate. We then propose an algorithm to rerank web search results based on optimizing an objective function
corresponding to this metric and evaluate it on shopping related queries. 相似文献
8.
面向案例的隐性知识挖掘方法研究 总被引:1,自引:0,他引:1
案例是对以往经验的知识表达,它是组织保存隐性知识的一种重要形式,从案例中挖掘隐性知识是知识管理的重要内容.案例表达是案例挖掘的首要和关键环节.本文提出了一种基于本体的案例表达模型,它能够基于本体中通用词汇和概念间多种关系对案例进行准确地描述和清晰地组织,且具有较高的扩展性和灵活性.为了提高案例中隐性知识的挖掘效率,提出了谓词路径图的概念和相关理论,以及基于谓词路径图的多维关联规则挖掘算法Ex-Apriori,该算法只需一遍扫描案例库.最后,通过构建一个小型手机维修案例库,验证了该方法的有效性. 相似文献
9.
Streaming data poses a variety of new and interesting challenges for information retrieval and text analysis. Unlike static
document collections, which are typically analyzed and indexed off-line to support ad-hoc queries, streaming data often must
be analyzed on the fly and acted on as the data passes through the analysis system. Speech is one example of streaming data
that is a challenge to exploit, yet has significant potential to provide value in a knowledge management system. We are specifically
interested in techniques that analyze streaming data and automatically find collateral information, or information that clarifies, expands, and generally enhances the value of the streaming data. We present a system that
analyzes a data stream and automatically finds documents related to the current topic of discussion in the data stream. Experimental
results show that the system generates result lists with an average precision at 10 hits of better than 60%. We also present
a hit-list re-ranking technique based on named entity analysis and automatic text categorization that can improve the search
results by 6%–12%. 相似文献
10.
Cluster-based and passage-based document retrieval paradigms were shown to be effective. While the former are based on utilizing
query-related corpus context manifested in clusters of similar documents, the latter address the fact that a document can
be relevant even if only a very small part of it contains query-pertaining information. Hence, cluster-based approaches could
be viewed as based on “expanding” the document representation, while passage-based approaches can be thought of as utilizing
a “contracted” document representation. We present a study of the relative benefits of using each of these two approaches,
and of the potential merits of their integration. To that end, we devise two methods that integrate whole-document-based,
cluster-based and passage-based information. The methods are applied for the re-ranking task, that is, re-ordering documents in an initially retrieved list so as to improve precision at the very top ranks. Extensive
empirical evaluation attests to the potential merits of integrating these information types. Specifically, the resultant performance
substantially transcends that of the initial ranking; and, is often better than that of a state-of-the-art pseudo-feedback-based
query expansion approach. 相似文献
11.
R W Wender E L Fruehauf M S Vent C D Wilson 《Bulletin of the Medical Library Association》1977,65(3):338-341
Part II of this study of the needs of clinicians for continuing medical education (CME) examines the results of a questionnaire sent of Oklahoma physicians to determine if they would request formal CME courses in the same subject areas in which they had previously requested in formation from librarians. The degree of correlation between literature search requests and responses to the questionnaire confirms that the analysis of library information requests may be one approach to determining CME needs. 相似文献
12.
Most search engines display some document metadata, such as title, snippet and URL, in conjunction with the returned hits
to aid users in determining documents. However, metadata is usually fragmented pieces of information that, even when combined,
does not provide an overview of a returned document. In this paper, we propose a mechanism of enriching metadata of the returned
results by incorporating automatically extracted document keyphrases with each returned hit. We hypothesize that keyphrases
of a document can better represent the major theme in that document. Therefore, by examining the keyphrases in each returned
hit, users can better predict the content of documents and the time spent on downloading and examining the irrelevant documents
will be reduced substantially. 相似文献
13.
João Palotti Allan Hanbury Henning Müller Charles E. KahnJr. 《Information Retrieval》2016,19(1-2):189-224
The internet is an important source of medical knowledge for everyone, from laypeople to medical professionals. We investigate how these two extremes, in terms of user groups, have distinct needs and exhibit significantly different search behaviour. We make use of query logs in order to study various aspects of these two kinds of users. The logs from America Online, Health on the Net, Turning Research Into Practice and American Roentgen Ray Society (ARRS) GoldMiner were divided into three sets: (1) laypeople, (2) medical professionals (such as physicians or nurses) searching for health content and (3) users not seeking health advice. Several analyses are made focusing on discovering how users search and what they are most interested in. One possible outcome of our analysis is a classifier to infer user expertise, which was built. We show the results and analyse the feature set used to infer expertise. We conclude that medical experts are more persistent, interacting more with the search engine. Also, our study reveals that, conversely to what is stated in much of the literature, the main focus of users, both laypeople and professionals, is on disease rather than symptoms. The results of this article, especially through the classifier built, could be used to detect specific user groups and then adapt search results to the user group. 相似文献
14.
15.
16.
New communication technologies have increased Europe's importation of foreign, especially U.S., broadcasting programs. This essay addresses the question and protection of cultural identity from a European perspective. Four subgoals of diversity are outlined as yardsticks for the appropriateness of protection. Various activities designed to deregulate European broadcasting may potentially increase or reduce diversity. The economic and political pressure to weaken the European trusteeship model of broadcasting is likely to be successful. 相似文献
17.
Blog feed search aims to identify a blog feed of recurring interest to users on a given topic. A blog feed, the retrieval
unit for blog feed search, comprises blog posts of diverse topics. This topical diversity of blog feeds often causes performance
deterioration of blog feed search. To alleviate the problem, this paper proposes several approaches based on passage retrieval,
widely regarded as effective to handle topical diversity at document level in ad-hoc retrieval. We define the global and local
evidence for blog feed search, which correspond to the document-level and passage-level evidence for passage retrieval, respectively,
and investigate their influence on blog feed search, in terms of both initial retrieval and pseudo-relevance feedback. For
initial retrieval, we propose a retrieval framework to integrate global evidence with local evidence. For pseudo-relevance
feedback, we gather feedback information from the local evidence of the top K ranked blog feeds to capture diverse and accurate information related to a given topic. Experimental results show that our
approaches using local evidence consistently and significantly outperform traditional ones. 相似文献
18.
Business librarians can benefit from a deeper understanding of science and engineering resources, especially when researching technological topics, products, or industries. The authors introduce a variety of technical publications and databases useful for business research. Categories of resources described include technical encyclopedias and handbooks, periodical indexes and full-text databases, patents, technical reports, product literature, preprint and open-access repositories, conference proceedings, and dissertations. A case study based on MP3 audio technology demonstrates the types of business information that can be uncovered through a careful search of technical sources. 相似文献
19.
Collection selection is a crucial function, central to the effectiveness and efficiency of a federated information retrieval
system. A variety of solutions have been proposed for collection selection adapting proven techniques used in centralised
retrieval. This paper defines a new approach to collection selection that models the topical distribution in each collection.
We describe an extended version of latent Dirichlet allocation that uses a hierarchical hyperprior to enable the different
topical distributions found in each collection to be modelled. Under the model, resources are ranked based on the topical
relationship between query and collection. By modelling collections in a low dimensional topic space, we can implicitly smooth
their term-based characterisation with appropriate terms from topically related samples, thereby dealing with the problem
of missing vocabulary within the samples. An important advantage of adopting this hierarchical model over current approaches
is that the model generalises well to unseen documents given small samples of each collection. The latent structure of each
collection can therefore be estimated well despite imperfect information for each collection such as sampled documents obtained
through query-based sampling. Experiments demonstrate that this new, fully integrated topical model is more robust than current
state of the art collection selection algorithms. 相似文献
20.