共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper reports our experimental investigation into the use of more realistic concepts as opposed to simple keywords for document retrieval, and reinforcement learning for improving document representations to help the retrieval of useful documents for relevant queries. The framework used for achieving this was based on the theory of Formal Concept Analysis (FCA) and Lattice Theory. Features or concepts of each document (and query), formulated according to FCA, are represented in a separate concept lattice and are weighted separately with respect to the individual documents they present. The document retrieval process is viewed as a continuous conversation between queries and documents, during which documents are allowed to learn a set of significant concepts to help their retrieval. The learning strategy used was based on relevance feedback information that makes the similarity of relevant documents stronger and non-relevant documents weaker. Test results obtained on the Cranfield collection show a significant increase in average precisions as the system learns from experience. 相似文献
2.
3.
4.
5.
Text categorization is an important research area and has been receiving much attention due to the growth of the on-line information and of Internet. Automated text categorization is generally cast as a multi-class classification problem. Much of previous work focused on binary document classification problems. Support vector machines (SVMs) excel in binary classification, but the elegant theory behind large-margin hyperplane cannot be easily extended to multi-class text classification. In addition, the training time and scaling are also important concerns. On the other hand, other techniques naturally extensible to handle multi-class classification are generally not as accurate as SVM. This paper presents a simple and efficient solution to multi-class text categorization. Classification problems are first formulated as optimization via discriminant analysis. Text categorization is then cast as the problem of finding coordinate transformations that reflects the inherent similarity from the data. While most of the previous approaches decompose a multi-class classification problem into multiple independent binary classification tasks, the proposed approach enables direct multi-class classification. By using generalized singular value decomposition (GSVD), a coordinate transformation that reflects the inherent class structure indicated by the generalized singular values is identified. Extensive experiments demonstrate the efficiency and effectiveness of the proposed approach. 相似文献
6.
Text mining techniques for patent analysis 总被引:1,自引:0,他引:1
Patent documents contain important research results. However, they are lengthy and rich in technical terminology such that it takes a lot of human efforts for analyses. Automatic tools for assisting patent engineers or decision makers in patent analysis are in great demand. This paper describes a series of text mining techniques that conforms to the analytical process used by patent analysts. These techniques include text segmentation, summary extraction, feature selection, term association, cluster generation, topic identification, and information mapping. The issues of efficiency and effectiveness are considered in the design of these techniques. Some important features of the proposed methodology include a rigorous approach to verify the usefulness of segment extracts as the document surrogates, a corpus- and dictionary-free algorithm for keyphrase extraction, an efficient co-word analysis method that can be applied to large volume of patents, and an automatic procedure to create generic cluster titles for ease of result interpretation. Evaluation of these techniques was conducted. The results confirm that the machine-generated summaries do preserve more important content words than some other sections for classification. To demonstrate the feasibility, the proposed methodology was applied to a real-world patent set for domain analysis and mapping, which shows that our approach is more effective than existing classification systems. The attempt in this paper to automate the whole process not only helps create final patent maps for topic analyses, but also facilitates or improves other patent analysis tasks such as patent classification, organization, knowledge sharing, and prior art searches. 相似文献
7.
《Information processing & management》2005,41(1):75-95
This paper proposes two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. approach (LSA + T.R.M.). The first is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Two new ideas are exploited: (1) sentence positions are ranked to emphasize the significances of different sentence positions, and (2) the score function is trained by the genetic algorithm (GA) to obtain a suitable combination of feature weights. The second uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map. We evaluate LSA + T.R.M. both with single documents and at the corpus level to investigate the competence of LSA in text summarization. The two novel approaches were measured at several compression rates on a data corpus composed of 100 political articles. When the compression rate was 30%, an average f-measure of 49% for MCBA, 52% for MCBA + GA, 44% and 40% for LSA + T.R.M. in single-document and corpus level were achieved respectively. 相似文献
8.
9.
朱大明 《中国科技期刊研究》2017,28(3):251-256
【目的】 为科技编辑出版中的参考文献引用、引文内容鉴审、引用格式规范化,以及科技期刊引文分析和学术评价提供整体性的思维框架和理论参考。【方法】 通过分析科学引证的概念,提出由科学论著、参考文献和引文内容三要素构成的科学引证概念模型。【结果】 根据科学引证概念模型,从科技编辑视角,系统分析了科学论著、参考文献与引文内容的相互作用关系。【结论】 科学引证概念模型分析发现了科技编辑出版和引文分析中存在的问题和缺陷,并得到若干有益的启示,启发值得引起重视的研究课题。 相似文献
10.
科创走廊作为区域创新发展新形式,对创新要素的集聚以及区域创新网络的构建起到重要作用。目前,科创走廊的建设正成为我国各地域探索创新实践与区域治理的战略重点。然而,科创走廊的内涵与边界尚待明确,需要对此进行思考。本研究在横向比对区域创新发展形式的前提下,思考了科创走廊的本质内涵,认为科创走廊是将创新过程贯彻完整的特殊区域创新发展模式。本研究进一步分析国内外科创走廊实例,结合知识视角,认为科创走廊内部参与创新活动的组织所在地理空间为科创走廊实体边界,创新活动为虚拟边界,科技创新为发展导向。本研究为企划建设以及正在建设科创走廊的区域提供了理念指导,促使其高效率、高资源利用率、高效益发展。 相似文献
11.
文本挖掘与中文文本挖掘模型研究 总被引:9,自引:0,他引:9
文本挖掘,又称为文本数据挖掘或文本知识发现,是指在大规模的文本中发现隐含的、以前未知的、潜在有用的模式的过程。本文首先对文本挖掘进行了概述,给出了文本挖掘的定义、特点和研究现状。然后对国内中文文本挖掘的研究现状进行了分析,指出了当前中文文本挖掘研究中存在的主要问题和主要研究方向。最后提出了一个统一的中文文本挖掘模型——UCTMF。该模型具有层次性、开放性和可扩展性,为中文文本挖掘系统提供了基本体系框架。 相似文献
12.
13.
本文研究了激励基础研究的政府研究基金诱导的效率问题。由于基础研究项目的公共产品特性,其研究应由政府建立研究基金来支持,其资助力度应以能诱导出最优的基础研究资源配置为原则,而不是项目的社会价值。因此应制定确保基础研究资源有效配置的相关政策。 相似文献
14.
随着政府职能转变和科技计划管理体制改革的深入,科技评估活动越来越受到各级科技管理部门的重视,各层次的评估工作逐步开展起来.但是,目前业界对科技评估的理解还存在较大的差异,甚至有人认为,认识还是比较模糊.对有关科技评估基本内涵进行探讨,将有助于我国科技评估制度设计和科技评估体系的建立. 相似文献
15.
本研究采用2(语篇难度)×3(文章标记类型)的实验设计,研究了文章标记对英语阅读理解的影响。结果表明:(1)文章标记对英语语篇信息理解和保持主效应显著;(2)文章标记与英语语篇难度之间存在显著的交互作用,英语语篇容易时,标记效应不显著;英语语篇较难时,标记效应显著;(3)当阅读材料较难时,全标记条件下,测试者在英语语篇信息理解和保持方面得分最高;无标记次之;半标记最低。 相似文献
16.
在分析现有文本表示法的基础之处,提出一种以段落、语句、词语为层次结构的文本表示方法——文本空间表示模型,并在此模型基础上探讨一种以文本段落为基本单位的相似文本计算算法,以实现相似文本检测目标。最后建立测试集并在测试集上执行检测实验,结果表明此方具有较好的相似文本发现效果。 相似文献
17.
18.
19.
20.
Yogesh Sankarasubramaniam Krishnan Ramanathan Subhankar Ghosh 《Information processing & management》2014
Automatic text summarization has been an active field of research for many years. Several approaches have been proposed, ranging from simple position and word-frequency methods, to learning and graph based algorithms. The advent of human-generated knowledge bases like Wikipedia offer a further possibility in text summarization – they can be used to understand the input text in terms of salient concepts from the knowledge base. In this paper, we study a novel approach that leverages Wikipedia in conjunction with graph-based ranking. Our approach is to first construct a bipartite sentence–concept graph, and then rank the input sentences using iterative updates on this graph. We consider several models for the bipartite graph, and derive convergence properties under each model. Then, we take up personalized and query-focused summarization, where the sentence ranks additionally depend on user interests and queries, respectively. Finally, we present a Wikipedia-based multi-document summarization algorithm. An important feature of the proposed algorithms is that they enable real-time incremental summarization – users can first view an initial summary, and then request additional content if interested. We evaluate the performance of our proposed summarizer using the ROUGE metric, and the results show that leveraging Wikipedia can significantly improve summary quality. We also present results from a user study, which suggests that using incremental summarization can help in better understanding news articles. 相似文献