共查询到20条相似文献,搜索用时 734 毫秒
1.
2.
基于概念向量空间的文档语义分类模型研究 总被引:1,自引:0,他引:1
3.
This paper describes a method, using Genetic Programming, to automatically determine term weighting schemes for the vector
space model. Based on a set of queries and their human determined relevant documents, weighting schemes are evolved which
achieve a high average precision. In Information Retrieval (IR) systems, useful information for term weighting schemes is
available from the query, individual documents and the collection as a whole.
We evolve term weighting schemes in both local (within-document) and global (collection-wide) domains which interact with
each other correctly to achieve a high average precision. These weighting schemes are tested on well-known test collections
and are compared to the traditional tf-idf weighting scheme and to the BM25 weighting scheme using standard IR performance metrics.
Furthermore, we show that the global weighting schemes evolved on small collections also increase average precision on larger
TREC data. These global weighting schemes are shown to adhere to Luhn’s resolving power as both high and low frequency terms
are assigned low weights. However, the local weightings evolved on small collections do not perform as well on large collections.
We conclude that in order to evolve improved local (within-document) weighting schemes it is necessary to evolve these on
large collections. 相似文献
4.
文本相似度的计算方法以采用TF-IDF的方法对文本建模成词频向量空间模型(VSM)为主,本文结合科技期刊文献和专利文献特点,对TF-IDF的计算方法进行了改进,将词频的统计改进为科技术语的频率统计,提出了一种针对科技文献相似度的计算方法,该方法首先应用自然语言处理技术对科技文献进行预处理,采用科技术语的自动抽取方法进行科技文献术语的自动抽取,结合该文提出的术语权重计算公式构建向量空间模型,来计算科技期刊文献和专利文献之间的相似度。并利用真实有效的科学期刊和文献数据进行实验测试,实验结果表明文中提出的方法优于传统的TF-IDF计算方法。 相似文献
5.
6.
7.
8.
A standard approach to Information Retrieval (IR) is to model text as a bag of words. Alternatively, text can be modelled
as a graph, whose vertices represent words, and whose edges represent relations between the words, defined on the basis of
any meaningful statistical or linguistic relation. Given such a text graph, graph theoretic computations can be applied to measure various properties of the graph, and hence of the text. This work
explores the usefulness of such graph-based text representations for IR. Specifically, we propose a principled graph-theoretic
approach of (1) computing term weights and (2) integrating discourse aspects into retrieval. Given a text graph, whose vertices
denote terms linked by co-occurrence and grammatical modification, we use graph ranking computations (e.g. PageRank Page et al.
in The pagerank citation ranking: Bringing order to the Web. Technical report, Stanford Digital Library Technologies Project,
1998) to derive weights for each vertex, i.e. term weights, which we use to rank documents against queries. We reason that our
graph-based term weights do not necessarily need to be normalised by document length (unlike existing term weights) because
they are already scaled by their graph-ranking computation. This is a departure from existing IR ranking functions, and we
experimentally show that it performs comparably to a tuned ranking baseline, such as BM25 (Robertson et al. in NIST Special
Publication 500-236: TREC-4, 1995). In addition, we integrate into ranking graph properties, such as the average path length, or clustering coefficient, which
represent different aspects of the topology of the graph, and by extension of the document represented as a graph. Integrating
such properties into ranking allows us to consider issues such as discourse coherence, flow and density during retrieval.
We experimentally show that this type of ranking performs comparably to BM25, and can even outperform it, across different
TREC (Voorhees and Harman in TREC: Experiment and evaluation in information retrieval, MIT Press, 2005) datasets and evaluation measures. 相似文献
9.
In this article, we introduce an out-of-the-box automatic term weighting method for information retrieval. The method is based on measuring the degree of divergence from independence of terms from documents in terms of their frequency of occurrence. Divergence from independence has a well-establish underling statistical theory. It provides a plain, mathematically tractable, and nonparametric way of term weighting, and even more it requires no term frequency normalization. Besides its sound theoretical background, the results of the experiments performed on TREC test collections show that its performance is comparable to that of the state-of-the-art term weighting methods in general. It is a simple but powerful baseline alternative to the state-of-the-art methods with its theoretical and practical aspects. 相似文献
10.
11.
从信息分析的实际需求出发,对与电动汽车相关的5 405条专利数据进行术语抽取、生僻术语识别和字段比较研究。结果显示关键短语抽取的方法可行,互信息抽取的术语所在文档的平均文档长度更接近集合的平均文档长度;摘要和First Claim字段的术语存在一定差别,但对分类或聚类同等重要;生僻术语识别算法能够发现生僻词和高频词的对应关系。研究结论可以为专利文本挖掘和专利信息分析提供结果和方法,并为信息分析工作提供所需的参考术语。 相似文献
12.
Text Categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which prove effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, for text categorization, the integration of external WordNet lexical information to supplement training data for a semi-supervised clustering algorithm which can learn from both training and test documents to classify new unseen documents. This algorithm is the Semi-Supervised Fuzzy c-Means (ssFCM). Our experiments use Reuters 21578 database and consist of binary classifications for categories selected from the 115 TOPICS classes of the Reuters collection. Using the Vector Space Model, each document is represented by its original feature vector augmented with external feature vector generated using WordNet. We verify experimentally that the integration of WordNet helps ssFCM improve its performance, effectively addresses the classification of documents into categories with few training documents and does not interfere with the use of training data. 相似文献
13.
This article presents FIDJI, a question-answering (QA) system for French. FIDJI combines syntactic information with traditional
QA techniques such as named entity recognition and term weighting; it does not require any pre-processing other than classical
search engine indexing. Among other uses of syntax, we experiment in this system the validation of answers through different
documents, as well as specific techniques for answering different types of questions (e.g., yes/no or list questions). We
present several experiments which show the benefits of syntactic analysis, as well as multi-document validation. Different
types of questions and corpora are tested, and specificities are commented. Links with result aggregation are also discussed. 相似文献
14.
研究从科技论文文本中抽取作者关键词以外的科技术语的方法。因为标引效应问题,单纯选择论文中的关键词作为候选术语会影响术语库的数量和质量,需要考虑从论文文本中抽取术语。现有的大多数术语抽取方法重视采用termhood指标,而忽视unithood指标,针对此问题,在C-value算法的基础上,提出用于生成候选术语的中文术语构词规则和测量术语内部结合强度的unithood指标,实现从论文文本中抽取中文科技术语。以信息资源管理领域的术语抽取为例对提出的方法进行验证,实验结果证明,提出的方法能够有效地抽取领域科技术语,抽取精度较高。 相似文献
15.
Document clustering of scientific texts using citation contexts 总被引:3,自引:0,他引:3
Document clustering has many important applications in the area of data mining and information retrieval. Many existing document
clustering techniques use the “bag-of-words” model to represent the content of a document. However, this representation is
only effective for grouping related documents when these documents share a large proportion of lexically equivalent terms.
In other words, instances of synonymy between related documents are ignored, which can reduce the effectiveness of applications
using a standard full-text document representation. To address this problem, we present a new approach for clustering scientific
documents, based on the utilization of citation contexts. A citation context is essentially the text surrounding the reference
markers used to refer to other scientific works. We hypothesize that citation contexts will provide relevant synonymous and
related vocabulary which will help increase the effectiveness of the bag-of-words representation. In this paper, we investigate
the power of these citation-specific word features, and compare them with the original document’s textual representation in
a document clustering task on two collections of labeled scientific journal papers from two distinct domains: High Energy
Physics and Genomics. We also compare these text-based clustering techniques with a link-based clustering algorithm which
determines the similarity between documents based on the number of co-citations, that is in-links represented by citing documents
and out-links represented by cited documents. Our experimental results indicate that the use of citation contexts, when combined
with the vocabulary in the full-text of the document, is a promising alternative means of capturing critical topics covered
by journal articles. More specifically, this document representation strategy when used by the clustering algorithm investigated
in this paper, outperforms both the full-text clustering approach and the link-based clustering technique on both scientific
journal datasets. 相似文献
16.
Eva D’hondt Suzan Verberne Nelleke Oostdijk Jean Beney Cornelius Koster Lou Boves 《Information Retrieval》2014,17(5-6):520-544
In this paper, we quantify the existence of concept drift in patent data, and examine its impact on classification accuracy. When developing algorithms for classifying incoming patent applications with respect to their category in the International Patent Classification (IPC) hierarchy, a temporal mismatch between training data and incoming documents may deteriorate classification results. We measure the effect of this temporal mismatch and aim to tackle it by optimal selection of training data. To illustrate the various aspects of concept drift on IPC class level, we first perform quantitative analyses on a subset of English abstracts extracted from patent documents in the CLEF-IP 2011 patent corpus. In a series of classification experiments, we then show the impact of temporal variation on the classification accuracy of incoming applications. We further examine what training data selection method, combined with our classification approach yields the best classifier; and how combining different text representations may improve patent classification. We found that using the most recent data is a better strategy than static sampling but that extending a set of recent training data with older documents does not harm classification performance. In addition, we confirm previous findings that using 2-skip-2-grams on top of the bag of unigrams structurally improves patent classification. Our work is an important contribution to the research into concept drift for text classification, and to the practice of classifying incoming patent applications. 相似文献
17.
Yu Zhang Min Wang Florian Gottwalt Morteza Saberi Elizabeth Chang 《Journal of Informetrics》2019,13(2):616-634
As the volume of scientific articles has grown rapidly over the last decades, evaluating their impact becomes critical for tracing valuable and significant research output. Many studies have proposed various ranking methods to estimate the prestige of academic papers using bibliometric methods. However, the weight of the links in bibliometric networks has been rarely considered for article ranking in existing literature. Such incomplete investigation in bibliometric methods could lead to biased ranking results. Therefore, a novel scientific article ranking algorithm, W-Rank, is introduced in this study proposing a weighting scheme. The scheme assigns weight to the links of citation network and authorship network by measuring citation relevance and author contribution. Combining the weighted bibliometric networks and a propagation algorithm, W-Rank is able to obtain article ranking results that are more reasonable than existing PageRank-based methods. Experiments are conducted on both arXiv hep-th and Microsoft Academic Graph datasets to verify the W-Rank and compare it with three renowned article ranking algorithms. Experimental results prove that the proposed weighting scheme assists the W-Rank in obtaining ranking results of higher accuracy and, in certain perspectives, outperforming the other algorithms. 相似文献
18.
一种基于主题词表的快速中文文本分类技术 总被引:1,自引:0,他引:1
针对中文文本的自动分类问题,提出了一种新的算法.该算法的基本思路是构造一个带权值的分类主题词表,该词表采用键树的方式构建,然后利用哈希杂凑法和长词匹配优先原则在主题词表中匹配待分类的文档中的字符串,并统计匹配成功的权值和,以权值和最大者作为分类结果.本算法可以避开中文分词的难点和它对分类结果的影响.理论分析和实验结果表明,该技术分类结果的准确度和时间效率都比较高,其综合性能达到了目前主流技术的水平. 相似文献
19.
为探究面向学科新兴主题探测领域多源科技文献融合过程中的时滞性问题,本文设计了多源科技文献时滞计算方案。首先,从获取的4种科技文献数据集中提取学科主题,计算学科主题间的相似度,构建相似矩阵;其次,基于匈牙利最优匹配算法寻求相似度损耗最小条件下的最优组合;最后,构建线性方程模型并拟合计算时滞程度。本文以2009-2016年农业学科领域337790篇摘要文本为实验数据,抽取基金项目文本学科主题为250个、专利文献为260个、期刊论文为260个、会议论文为240个,利用上述多源科技文献时滞计算方案实验。结果表明:期刊论文滞后于基金项目文本和会议论文1年,专利文献滞后于期刊论文1年,结合以往对不同学科领域数据的研究结果,验证了多源科技文献时滞计算方案的可行性和有效性,同时也为多源科技文献融合策略的制定提供新思路。 相似文献
20.
Markus Schedl 《Information Retrieval》2012,15(3-4):183-217
Different term weighting techniques such as $TF\cdot IDF$ or BM25 have been used intensely for manifold text-based information retrieval tasks. Their use for modeling term profiles for named entities and subsequent calculation of similarities between these named entities have been studied to a much smaller extent. The recent trend of microblogging made available massive amounts of information about almost every topic around the world. Therefore, microblogs represent a valuable source for text-based named entity modeling. In this paper, we present a systematic and comprehensive evaluation of different term weighting measures, normalization techniques, query schemes, index term sets, and similarity functions for the task of inferring similarities between named entities, based on data extracted from microblog posts. We analyze several thousand combinations of choices for the above mentioned dimensions, which influence the similarity calculation process, and we investigate in which way they impact the quality of the similarity estimates. Evaluation is performed using three real-world data sets: two collections of microblogs related to music artists and one related to movies. For the music collections, we present results of genre classification experiments using as benchmark genre information from allmusic.com . For the movie collection, we present results of multi-class classification experiments using as benchmark categories from IMDb . We show that microblogs can indeed be exploited to model named entity similarity with remarkable accuracy, provided the correct settings for the analyzed aspects are used. We further compare the results to those obtained when using Web pages as data source. 相似文献