首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Rocchio relevance feedback and latent semantic indexing (LSI) are well-known extensions of the vector space model for information retrieval (IR). This paper analyzes the statistical relationship between these extensions. The analysis focuses on each method’s basis in least-squares optimization. Noting that LSI and Rocchio relevance feedback both alter the vector space model in a way that is in some sense least-squares optimal, we ask: what is the relationship between LSI’s and Rocchio’s notions of optimality? What does this relationship imply for IR? Using an analytical approach, we argue that Rocchio relevance feedback is optimal if we understand retrieval as a simplified classification problem. On the other hand, LSI’s motivation comes to the fore if we understand it as a biased regression technique, where projection onto a low-dimensional orthogonal subspace of the documents reduces model variance.  相似文献   

2.
In Information Retrieval (IR), the efficient indexing of terabyte-scale and larger corpora is still a difficult problem. MapReduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this work, we provide a detailed analysis of four MapReduce indexing strategies of varying complexity. Moreover, we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop MapReduce implementation, in combination with several large standard TREC test corpora. In particular, we examine the efficiency of the indexing strategies, and for the most efficient strategy, we examine how it scales with respect to corpus size, and processing power. Our results attest to both the importance of minimising data transfer between machines for IO intensive tasks like indexing, and the suitability of the per-posting list MapReduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that MapReduce is a suitable framework for the deployment of large-scale indexing.  相似文献   

3.
4.
This paper proposes two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. approach (LSA + T.R.M.). The first is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Two new ideas are exploited: (1) sentence positions are ranked to emphasize the significances of different sentence positions, and (2) the score function is trained by the genetic algorithm (GA) to obtain a suitable combination of feature weights. The second uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map. We evaluate LSA + T.R.M. both with single documents and at the corpus level to investigate the competence of LSA in text summarization. The two novel approaches were measured at several compression rates on a data corpus composed of 100 political articles. When the compression rate was 30%, an average f-measure of 49% for MCBA, 52% for MCBA + GA, 44% and 40% for LSA + T.R.M. in single-document and corpus level were achieved respectively.  相似文献   

5.
To obtain high performances, previous works on FAQ retrieval used high-level knowledge bases or handcrafted rules. However, it is a time and effort consuming job to construct these knowledge bases and rules whenever application domains are changed. To overcome this problem, we propose a high-performance FAQ retrieval system only using users’ query logs as knowledge sources. During indexing time, the proposed system efficiently clusters users’ query logs using classification techniques based on latent semantic analysis. During retrieval time, the proposed system smoothes FAQs using the query log clusters. In the experiment, the proposed system outperformed the conventional information retrieval systems in FAQ retrieval. Based on various experiments, we found that the proposed system could alleviate critical lexical disagreement problems in short document retrieval. In addition, we believe that the proposed system is more practical and reliable than the previous FAQ retrieval systems because it uses only data-driven methods without high-level knowledge sources.  相似文献   

6.
廖文彬 《科技广场》2009,(11):48-49
由于矩阵的奇异值具备的一些良好性质,它在许多领域得到了非常广泛的应用.但是由于传统的矩阵奇异值求解方法很难得到良好的结果,这就给奇异值的实际应用带来了不小的难度.本文采用抽样估计的方法估计大规模矩阵的奇异值,探索一种求解大矩阵奇异值的方法.  相似文献   

7.
8.
9.
The paper discusses the notion of steps in indexing and reveals that the document-centered approach to indexing is prevalent and argues that the document-centered approach is problematic because it blocks out context-dependent factors in the indexing process. A domain-centered approach to indexing is presented as an alternative and the paper discusses how this approach includes a broader range of analyses and how it requires a new set of actions from using this approach; analysis of the domain, users and indexers. The paper concludes that the two-step procedure to indexing is insufficient to explain the indexing process and suggests that the domain-centered approach offers a guide for indexers that can help them manage the complexity of indexing.  相似文献   

10.
共词聚类分析法通过聚类的方式对学科主题词进行归类划分,从而实现对学科结构的分析研究.没有聚集中心的聚类,使得类团划分方式与学科研究点主题词分布模式存在一定的差别,并对类团分析产生较大负面影响.为类团指定核心词,并把核心词放置于共词矩阵中进行分析,有助于对类团概念的正确定义以及类团之间的关系分析,其至修正聚类算法中存在的一些问题.笔者的创新点在于通过指定类团核心词,解决没有聚集中心的聚类过程所存在的问题.  相似文献   

11.
我国情报学硕士学位论文的共词聚类分析   总被引:18,自引:0,他引:18  
利用《CNKI中国优秀硕士学位论文全文数据库》中收录的624篇情报学硕士学位论文,对高频关键词进行共词聚类分析,研究各高频关键词之间的内在关系,探索情报学硕士学位论文的研究热点.  相似文献   

12.
13.
王艳 《科技广场》2009,(5):79-81
在无线传感器网络中,基于分簇的路由协议在节能和数据融合方面有突出表现,但其网络通信的可靠性和安全性在很大程度上依赖簇头,造成了严重的安全隐患.针对这一问题,本文提出了一种新的基于入侵检测的无线传感器网络安全系统体系结构,并具体阐述了基于该体系结构的WSN安全系统运行原理.  相似文献   

14.
15.
This paper proposes a novel hierarchical learning strategy to deal with the data sparseness problem in semantic relation extraction by modeling the commonality among related classes. For each class in the hierarchy either manually predefined or automatically clustered, a discriminative function is determined in a top-down way. As the upper-level class normally has much more positive training examples than the lower-level class, the corresponding discriminative function can be determined more reliably and guide the discriminative function learning in the lower-level one more effectively, which otherwise might suffer from limited training data. In this paper, two classifier learning approaches, i.e. the simple perceptron algorithm and the state-of-the-art Support Vector Machines, are applied using the hierarchical learning strategy. Moreover, several kinds of class hierarchies either manually predefined or automatically clustered are explored and compared. Evaluation on the ACE RDC 2003 and 2004 corpora shows that the hierarchical learning strategy much improves the performance on least- and medium-frequent relations.  相似文献   

16.
图像数字水印技术,作为潜在的可有效解决数字图书馆中图像资源版权保护问题的手段,受到了广大学者越来越多的关注.为有效地实现图像资源的版权保护,该文在讨论和分析了图像资源版权保护的现状和发展趋势后,提出了一种新的简单、方便、容易的SVD域图像数字水印方法.该文方法在对数字水印的提取时,无须原始图像的参与,实用性强.实验结果表明,该文方法对数字水印具有很好的透明性,并且方法对图像JPEG压缩、亮度调整、对比度调整、叠加噪声等常见图像处理攻击均具有很强的稳健性.因此,该文方法可有效地运用于图像资源的版权保护.  相似文献   

17.
This paper presents a size reduction method for the inverted file, the most suitable indexing structure for an information retrieval system (IRS). We notice that in an inverted file the document identifiers for a given word are usually clustered. While this clustering property can be used in reducing the size of the inverted file, good compression as well as fast decompression must both be available. In this paper, we present a method that can facilitate coding and decoding processes for interpolative coding using recursion elimination and loop unwinding. We call this method the unique-order interpolative coding. It can calculate the lower and upper bounds of every document identifier for a binary code without using a recursive process, hence the decompression time can be greatly reduced. Moreover, it also can exploit document identifier clustering to compress the inverted file efficiently. Compared with the other well-known compression methods, our method provides fast decoding speed and excellent compression. This method can also be used to support a self-indexing strategy. Therefore our research work in this paper provides a feasible way to build a fast and space-economical IRS.  相似文献   

18.
GRANT is an expert system for finding sources of funding given research proposals. Its search method-constrained spreading activation—makes inferences about the goals of the user and thus finds information that the user did not explicitly request but that is likely to be useful. The architecture of GRANT and the implementation of constrained spreading activation are described, and grant's performance is evaluated.  相似文献   

19.
20.
移动通信技术标准化的国家战略与企业战略   总被引:4,自引:0,他引:4  
在移动通信发展过程中,技术标准的竞争是现代移动通信市场竞争的最重要方面。从移动通信技术标准化的发展历程来看,国家战略与企业战略存在着一定的相互影响和相互作用,其中国家战略发挥着关键的作用。本文通过理论分析和国际经验的比较研究,得出了中国3G技术标准化国家战略的政策性含义。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号