Many enterprise employees may publish content outside their corporate intranet, making the Web a valuable source for identifying company experts. In this article, we thoroughly investigate the usefulness of Web search engines (WSEs) for expert search. In particular, we claim that the ranking of documentary expertise evidence provided by a WSE should also give an indication of the importance of such evidence. To investigate this, we mimic the rankings of seven different WSEs by trying to reproduce their underlying ranking mechanisms in order to search for candidate experts in the TREC CERC collection. Experimental results show that our approach is effective for expert search, and can significantly improve an intranet-based expert search engine. Moreover, when the mimicking of WSEs is further improved by training, expert search performance is also generally enhanced. Finally, we show that WSEs can be mimicked as effectively using only titles and snippets instead of the full content of WSEs’ results, while drastically reducing network costs.  相似文献   

Document similarity search (i.e. query by example) aims to retrieve a ranked list of documents similar to a query document in a text corpus or on the Web. Most existing approaches to similarity search first compute the pairwise similarity score between each document and the query using a retrieval function or similarity measure (e.g. Cosine), and then rank the documents by the similarity scores. In this paper, we propose a novel retrieval approach based on manifold-ranking of document blocks (i.e. a block of coherent text about a subtopic) to re-rank a small set of documents initially retrieved by some existing retrieval function. The proposed approach can make full use of the intrinsic global manifold structure of the document blocks by propagating the ranking scores between the blocks on a weighted graph. First, the TextTiling algorithm and the VIPS algorithm are respectively employed to segment text documents and web pages into blocks. Then, each block is assigned with a ranking score by the manifold-ranking algorithm. Lastly, a document gets its final ranking score by fusing the scores of its blocks. Experimental results on the TDT data and the ODP data demonstrate that the proposed approach can significantly improve the retrieval performances over baseline approaches. Document block is validated to be a better unit than the whole document in the manifold-ranking process.  相似文献   

Statistical language models have been successfully applied to many information retrieval tasks, including expert finding: the process of identifying experts given a particular topic. In this paper, we introduce and detail language modeling approaches that integrate the representation, association and search of experts using various textual data sources into a generative probabilistic framework. This provides a simple, intuitive, and extensible theoretical framework to underpin research into expertise search. To demonstrate the flexibility of the framework, two search strategies to find experts are modeled that incorporate different types of evidence extracted from the data, before being extended to also incorporate co-occurrence information. The models proposed are evaluated in the context of enterprise search systems within an intranet environment, where it is reasonable to assume that the list of experts is known, and that data to be mined is publicly accessible. Our experiments show that excellent performance can be achieved by using these models in such environments, and that this theoretical and empirical work paves the way for future principled extensions.  相似文献   

With ever increasing information being available to the end users, search engines have become the most powerful tools for obtaining useful information scattered on the Web. However, it is very common that even most renowned search engines return result sets with not so useful pages to the user. Research on semantic search aims to improve traditional information search and retrieval methods where the basic relevance criteria rely primarily on the presence of query keywords within the returned pages. This work is an attempt to explore different relevancy ranking approaches based on semantics which are considered appropriate for the retrieval of relevant information. In this paper, various pilot projects and their corresponding outcomes have been investigated based on methodologies adopted and their most distinctive characteristics towards ranking. An overview of selected approaches and their comparison by means of the classification criteria has been presented. With the help of this comparison, some common concepts and outstanding features have been identified.  相似文献   

In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which automatically discovers contexts of narrow scope within a document corpus. These contexts act as attractors for clustering documents that are semantically related to each other. Once clustered, the documents are organized into a minimum spanning tree so that the topical similarity of adjacent documents within this structure can be assessed. The pre-defined categories from three different document category sets are used to assess the quality of CDC in terms of its ability to group and structure semantically related documents given the contexts. Quality is evaluated based on two factors, the category overlap between adjacent documents within a cluster, and how well a representative document categorizes all the other documents within a cluster. As the RCV1 collection was collated in a time ordered fashion, it was possible to assess the stability of clusters formed from documents within one time interval when presented with new unseen documents at subsequent time intervals. We demonstrate that CDC is a powerful and scaleable technique with the ability to create stable clusters of high quality. Additionally, to our knowledge this is the first time that a collection as large as RCV1 has been analyzed in its entirety using a static clustering approach.  相似文献   

This paper describes an applied document filtering system embedded in an operational watch center that monitors disease outbreaks worldwide. At the initial time of this writing, the system effectively supported monitoring of 23 geographic regions by filtering documents in several thousand daily news sources in 11 different languages. This paper describes the filtering algorithm, statistical procedures for estimating Precision and Recall in an operational environment, summarizes operational performance data and suggests lessons learned for other applications of document filtering technology. Overall, these results are interpreted as supporting the general utility of document filtering and information retrieval technology and offers recommendations for future applications of this technology.  相似文献   

This paper presents a semantically rich document representation model for automatically classifying financial documents into predefined categories utilizing deep learning. The model architecture consists of two main modules including document representation and document classification. In the first module, a document is enriched with semantics using background knowledge provided by an ontology and through the acquisition of its relevant terminology. Acquisition of terminology integrated to the ontology extends the capabilities of semantically rich document representations with an in depth-coverage of concepts, thereby capturing the whole conceptualization involved in documents. Semantically rich representations obtained from the first module will serve as input to the document classification module which aims at finding the most appropriate category for that document through deep learning. Three different deep learning networks each belonging to a different category of machine learning techniques for ontological document classification using a real-life ontology are used.Multiple simulations are carried out with various deep neural networks configurations, and our findings reveal that a three hidden layer feedforward network with 1024 neurons obtain the highest document classification performance on the INFUSE dataset. The performance in terms of F1 score is further increased by almost five percentage points to 78.10% for the same network configuration when the relevant terminology integrated to the ontology is applied to enrich document representation. Furthermore, we conducted a comparative performance evaluation using various state-of-the-art document representation approaches and classification techniques including shallow and conventional machine learning classifiers.  相似文献   

In this paper, we present a novel clustering algorithm to generate a number of candidate clusters from other web search results. The candidate clusters generate a connective relation among the clusters and the relation is semantic. Moreover, the algorithm also contains the following attractive properties: (1) it can be applied to multilingual web documents, (2) it improves the clustering performance of any search engine, (3) its unsupervised learning can automatically identify potentially relevant knowledge without using any corpus, and (4) clustering results are generated on the fly and fitted into search engines.  相似文献   

We demonstrate effective new methods of document ranking based on lexical cohesive relationships between query terms. The proposed methods rely solely on the lexical relationships between original query terms, and do not involve query expansion or relevance feedback. Two types of lexical cohesive relationship information between query terms are used in document ranking: short-distance collocation relationship between query terms, and long-distance relationship, determined by the collocation of query terms with other words. The methods are evaluated on TREC corpora, and show improvements over baseline systems.  相似文献   

A comparative study of two types of patent retrieval tasks, technology survey and invalidity search, using the NTCIR-3 and -4 test collections is described, with a focus on pseudo-feedback effectiveness and different retrieval models. Invalidity searches are peculiar to patent retrieval tasks and feature small numbers of relevant documents and long queries. Different behaviors of effectiveness are observed when applying different retrieval models and pseudo-feedback. These different behaviors are analyzed in terms of the “weak cluster hypothesis”, i.e., terminological cohesiveness through relevant documents.  相似文献   

Most document clustering algorithms operate in a high dimensional bag-of-words space. The inherent presence of noise in such representation obviously degrades the performance of most of these approaches. In this paper we investigate an unsupervised dimensionality reduction technique for document clustering. This technique is based upon the assumption that terms co-occurring in the same context with the same frequencies are semantically related. On the basis of this assumption we first find term clusters using a classification version of the EM algorithm. Documents are then represented in the space of these term clusters and a multinomial mixture model (MM) is used to build document clusters. We empirically show on four document collections, Reuters-21578, Reuters RCV2-French, 20Newsgroups and WebKB, that this new text representation noticeably increases the performance of the MM model. By relating the proposed approach to the Probabilistic Latent Semantic Analysis (PLSA) model we further propose an extension of the latter in which an extra latent variable allows the model to co-cluster documents and terms simultaneously. We show on these four datasets that the proposed extended version of the PLSA model produces statistically significant improvements with respect to two clustering measures over all variants of the original PLSA and the MM models.  相似文献   

Large-scale web search engines are composed of multiple data centers that are geographically distant to each other. Typically, a user query is processed in a data center that is geographically close to the origin of the query, over a replica of the entire web index. Compared to a centralized, single-center search engine, this architecture offers lower query response times as the network latencies between the users and data centers are reduced. However, it does not scale well with increasing index sizes and query traffic volumes because queries are evaluated on the entire web index, which has to be replicated and maintained in all data centers. As a remedy to this scalability problem, we propose a document replication framework in which documents are selectively replicated on data centers based on regional user interests. Within this framework, we propose three different document replication strategies, each optimizing a different objective: reducing the potential search quality loss, the average query response time, or the total query workload of the search system. For all three strategies, we consider two alternative types of capacity constraints on index sizes of data centers. Moreover, we investigate the performance impact of query forwarding and result caching. We evaluate our strategies via detailed simulations, using a large query log and a document collection obtained from the Yahoo! web search engine.  相似文献   

基于创新能力结构视角,本文从组织-技术维度解构中国企业的跨界搜寻行为,并分析其对组织双元能力的差异化影响。研究发现:现实情境下,中国企业跨界搜寻行为包括组织维度的科技驱动型、市场驱动型跨界搜寻,与技术维度的共性技术导向、产品技术导向跨界搜寻;不同维度搜寻行为对组织双元能力产生差异化影响,科技驱动型、共性技术导向跨界搜寻对探索能力存在正向影响,而市场驱动型、共性技术和产品技术导向跨界搜寻对开发能力呈正向促进作用;突破了March(1991)对双元性的理论预设,发现双元能力之间并非绝对排斥或不可协调,组织探索能力对开发能力具有显著的正向影响。研究结论拓展了组织搜寻理论的研究视角与维度内涵,丰富了组织双元性领域的实证研究成果。  相似文献   

Despite a number of studies looking at Web experience and Web searching tactics and behaviours, the specific relationships between experience and cognitive search strategies have not been widely researched. This study investigates how the cognitive search strategies of 80 participants might vary with Web experience as they engaged in two researcher-defined tasks and two participant-defined information seeking tasks. Each of the two researcher-defined tasks and participant-defined tasks included a directed search task and a general-purpose browsing task. While there were almost no significant performance differences between experience levels on any of the four tasks, there were significant differences in the use of cognitive search strategies. Participants with higher levels of Web experience were more likely to use “Parallel player”, “Parallel hub-and-spoke”, “Known address search domain” and “Known address” strategies, whereas participants with lower levels of Web experience were more likely to use “Virtual tourist”, “Link-dependent”, “To-the-point”, “Sequential player”, “Search engine narrowing”, and “Broad first” strategies. The patterns of use and differences between researcher-defined and participant-defined tasks and between directed search tasks and general-purpose browsing tasks are also discussed, although the distribution of search strategies by Web experience were not statistically significant for each individual task.  相似文献   

宋晶  孙永磊  谢永平 《科研管理》2015,36(11):47-54
本文主要探索合作创新网络背景下企业网络化搜寻的主要结构类型及其作用机理,首先,借鉴扎根理论的研究方法,通过三次深度访谈,发现知识搜寻、关系搜寻和惯例搜寻是企业网络搜寻的主要结构类型。同时,以西安高新区高技术企业合作网络等为对象进行问卷调查,对网络搜寻的作用机理进行实证检验,结果显示,适度的关系搜寻有助于企业获得更高的合作创新绩效;知识搜寻不仅有利于合作创新绩效的提升,而且还在关系搜寻与合作创新绩效之间充当部分中介作用;惯例搜寻与合作创新绩效之间的正向关系通过实证检验,而且惯例搜寻还正向调控知识搜寻对合作创新绩效的影响。  相似文献   

中文搜索引擎结构初探   总被引:4,自引:0,他引:4  
朱华 《情报科学》2001,19(11):1210-1212
随着Internet的进一步发展,网上中文信息的激增使中文搜索引擎日益受到人们的关注。本文对中文搜索引擎的结构做了初步分析,将其划分为四大模块:网页搜集模块、网页索引模块、查询模块和用户界面,并对各模块的工作原理、技术做了相应的说明。  相似文献   

To improve search engine effectiveness, we have observed an increased interest in gathering additional feedback about users’ information needs that goes beyond the queries they type in. Adaptive search engines use explicit and implicit feedback indicators to model users or search tasks. In order to create appropriate models, it is essential to understand how users interact with search engines, including the determining factors of their actions. Using eye tracking, we extend this understanding by analyzing the sequences and patterns with which users evaluate query result returned to them when using Google. We find that the query result abstracts are viewed in the order of their ranking in only about one fifth of the cases, and only an average of about three abstracts per result page are viewed at all. We also compare search behavior variability with respect to different classes of users and different classes of search tasks to reveal whether user models or task models may be greater predictors of behavior. We discover that gender and task significantly influence different kinds of search behaviors discussed here. The results are suggestive of improvements to query-based search interface designs with respect to both their use of space and workflow.  相似文献   

Previous studies have repeatedly demonstrated that the relevance of a citing document is related to the number of times with which the source document is cited. Despite the ease with which electronic documents would permit the incorporation of this information into citation-based document search and retrieval systems, the possibilities of repeated citations remain untapped. Part of this under-utilization may be due to the fact that very little is known regarding the pattern of repeated citations in scholarly literature or how this pattern may vary as a function of journal, academic discipline or self-citation. The current research addresses these unanswered questions in order to facilitate the future incorporation of repeated citation information into document search and retrieval systems. Using data mining of electronic texts, the citation characteristics of nine different journals, covering the three different academic fields (economics, computing, and medicine & biology), were characterized. It was found that the frequency (f) with which a reference is cited N or more times within a document is consistent across the sampled journals and academic fields. Self-citation causes an increase in frequency, and this effect becomes more pronounced for large N. The objectivity, automatability, and insensitivity of repeated citations to journal and discipline, present powerful opportunities for improving citation-based document search.  相似文献   

