共查询到19条相似文献,搜索用时 125 毫秒
1.
[目的/意义]传统的文献主题提取方法主要是通过关键词、摘要、全文等提取文献的主题内容,使得主题内容不全面或存在"噪音",而从文献内容语义出发,结合引用内容提取文献的主题,能够更加准确地提取出多文档的主题内容。[方法/过程]提出一种面向多文档的基于语义和引用加权的科技文献主题提取算法,利用文献的引用内容和关键词构建Labeled-LDA主题模型,形成文档-主题概率向量,再根据K-means聚类方法聚类文档,提取每类文档集的主题内容。[结果/结论]以PubMed生物医学数据库中的数据作为实验数据,测试该方法的可靠性,结果证明该方法能够准确、全面地提取出多文档的主题内容。 相似文献
2.
面对自由无序的网络舆情信息,对舆情组织方式的研究体现出重要研究意义。文章提出一种网络舆情信息的组织方法,采用域加权的方式,通过一种single-pass增量算法聚类实现面向主题的舆情信息组织,即对新闻主题或新闻事件有较强表达能力的域进行加权处理以突出该主题或事件,再以无监督自动化的方式对无序的网络舆情信息进行聚类,进而发现热点话题,达到话题探测的目的。实验结果显示,聚类类簇均基于主题或事件,可以代表一个话题,F-measure评价值在85%以上,也进一步表明了本研究方法的有效性。 相似文献
3.
中文全文标引的主题词标引和主题概念标引方法1 总被引:20,自引:3,他引:17
中文全文标引正在越来越受到重视。本文主要研究了三个方面的问题,首先是全文主题词标引的加权问题,综合考虑了五个方面的因素;其次是介绍了一种用层次概念词典改进主题词标引质量的新方法;最后又提出了用三种不同的方法产生主题概念进行全文标引的的主题概念标引。受限范围内的实验结果显示本文的方法有一定的理论和实用价值。 相似文献
4.
基于读者借阅二分网络的图书可推荐质量测度方法及个性化图书推荐服务 总被引:1,自引:0,他引:1
本文首先提出一种利用读者借阅行为特征来判断图书可推荐质量的思路,并结合读者图书借阅关系所形成的二分网络结构,设计了一种测度图书可推荐质量的迭代算法,从而为个性化图书推荐服务提供了良好的推荐客体.在上述研究的基础上,结合图书类别目录层次、标题语义信息的提取处理方法、基于加权XML模型的用户个性化模式表达方法及其权值扩散策略,提出了三种图书馆个性化图书推荐服务的形式,分别是特定主题的图书推荐服务、现有所借图书的修正型推荐服务和新书推荐服务.最后,文章对相关测试实验及其效果做了必要的说明. 相似文献
5.
6.
[目的/意义] 为帮助读者从热点事件产生的海量微博报道中快速了解事件的来龙去脉,提高微博事件摘要的准确性和可读性,提出一种基于事件要素的多模型微博热点事件时间轴摘要提取方法。[方法/过程] 针对微博文本特征,结合主题模型(LDA)与互信息最大熵模型(MaRxEnt-MI)的特点提取事件摘要关键词,以微博传播价值和主题相关性为标准筛选微博,以时间-摘要关键词-摘要微博的形式生成时间轴摘要。[结果/结论] 利用人工标注的测试集,与传统的TextRank方法进行对比,F值提高8%-13%,内部测试表明摘要可读性提高明显。实验文本和测试集的数量及事件丰富度需要进一步扩展,应考虑更多的加权策略模型以提高摘要的准确性。实验结果及测试反馈表明,本文的方法能很好满足用户对热点事件摘要信息需求,提高微博摘要提取的准确率。 相似文献
7.
8.
马向东 《现代图书情报技术》1996,12(3):37-41
本文全面分析比较了加权检索的传统方法和布尔检索方法, 提出了加权检索的缺陷所在和一种可行的改进方法——这一方法基本上综合了传统加权检索和布尔检索的优点。 相似文献
9.
信息抽取是从海量网页获取有价值信息的重要方式,对目标网页内容进行主题相关性判断是提高信息抽取效率和准确性的关键环节。目前的相关性判断主要采用人工筛选和文档训练的方法,这其中存在效率低、重复训练等问题,而本文尝试针对抽取任务引入主题描述模型用于网页内容的主题相关性判断。从任务的主题描述模型的角度出发,计算模型中的关键词基于标记信息的加权频率,将网页内容进行量化表示,然后分析关键词加权频率关于任务主题描述模型的变化来判断网页内容的主题相关性。最后通过对比该方法在国防产品信息抽取中结果,实验证明该方法大大提高了网页信息抽取的效率和准确性。 相似文献
10.
[目的/意义]现有的关键词提取方法不适应社会化问答社区文本长度较短、内容表述口语化、数据集稀疏的特点,且很少考虑用户关注程度对词语重要性的影响,不能有效地提取此类文本的关键词,因此,提出针对社会化问答社区的多属性加权关键词提取方法。[方法/过程]多属性加权关键词提取方法通过引入调节函数和词性对传统TF-IDF进行改进,并通过线性加权融合用户回答数、关注数、浏览数以及评论数4个用户关注属性来综合度量词语权重。[结果/结论]实验表明,该方法能更有效地提取社会化问答社区文本的关键词。 相似文献
11.
In this article, we introduce an out-of-the-box automatic term weighting method for information retrieval. The method is based on measuring the degree of divergence from independence of terms from documents in terms of their frequency of occurrence. Divergence from independence has a well-establish underling statistical theory. It provides a plain, mathematically tractable, and nonparametric way of term weighting, and even more it requires no term frequency normalization. Besides its sound theoretical background, the results of the experiments performed on TREC test collections show that its performance is comparable to that of the state-of-the-art term weighting methods in general. It is a simple but powerful baseline alternative to the state-of-the-art methods with its theoretical and practical aspects. 相似文献
12.
This paper describes a method, using Genetic Programming, to automatically determine term weighting schemes for the vector
space model. Based on a set of queries and their human determined relevant documents, weighting schemes are evolved which
achieve a high average precision. In Information Retrieval (IR) systems, useful information for term weighting schemes is
available from the query, individual documents and the collection as a whole.
We evolve term weighting schemes in both local (within-document) and global (collection-wide) domains which interact with
each other correctly to achieve a high average precision. These weighting schemes are tested on well-known test collections
and are compared to the traditional tf-idf weighting scheme and to the BM25 weighting scheme using standard IR performance metrics.
Furthermore, we show that the global weighting schemes evolved on small collections also increase average precision on larger
TREC data. These global weighting schemes are shown to adhere to Luhn’s resolving power as both high and low frequency terms
are assigned low weights. However, the local weightings evolved on small collections do not perform as well on large collections.
We conclude that in order to evolve improved local (within-document) weighting schemes it is necessary to evolve these on
large collections. 相似文献
13.
14.
15.
《Journal of Informetrics》2020,14(4):101076
The effective representation of the relationship between the documents and their contents is crucial to increase classification performance of text documents in the text classification. Term weighting is a preprocess aiming to represent text documents better in Vector Space by assigning proper weights to terms. Since the calculation of the appropriate weight values directly affects performance of the text classification, in the literature, term weighting is still one of the important sub-research areas of text classification. In this study, we propose a novel term weighting (MONO) strategy which can use the non-occurrence information of terms more effectively than existing term weighting approaches in the literature. The proposed weighting strategy also performs intra-class document scaling to supply better representations of distinguishing capabilities of terms occurring in the different quantity of documents in the same quantity of class. Based on the MONO weighting strategy, two novel supervised term weighting schemes called TF-MONO and SRTF-MONO were proposed for text classification. The proposed schemes were tested with two different classifiers such as SVM and KNN on 3 different datasets named Reuters-21578, 20-Newsgroups, and WebKB. The classification performances of the proposed schemes were compared with 5 different existing term weighting schemes in the literature named TF-IDF, TF-IDF-ICF, TF-RF, TF-IDF-ICSDF, and TF-IGM. The results obtained from 7 different schemes show that SRTF-MONO generally outperformed other schemes for all three datasets. Moreover, TF-MONO has promised both Micro-F1 and Macro-F1 results compared to other five benchmark term weighting methods especially on the Reuters-21578 and 20-Newsgroups datasets. 相似文献
16.
对信息检索系统返回结果相关度的改进,一直是信息检索领域重要的研究内容。本文首先引入查询词出现信息的概念,随后给出了查询词出现权重的形式化表示,进而将其与BM25模型结合起来。对于查询词出现权重的计算,本文采用了两种方法,即线性加权方法和因数加权方法。我们通过在GOV2数据集上的实验发现,无论哪种方法,通过加入查询词出现权重,都可以有效的改进检索结果的相关度。实验显示,对于TREC 2005的查询,MAP值的改进达到15.78%,p@10的改进达到3468%。本文所描述的方法已经应用到TREC 2009的WebTrack中。 相似文献
17.
18.
学科文献核心出版社测算方法研究 总被引:2,自引:0,他引:2
19.
We investigate the effect of feature weighting on document clustering, including a novel investigation of Okapi BM25 feature
weighting. Using eight document datasets and 17 well-established clustering algorithms we show that the benefit of tf-idf weighting over tf weighting is heavily dependent on both the dataset being clustered and the algorithm used. In addition, binary weighting
is shown to be consistently inferior to both tf-idf weighting and tf weighting. We investigate clustering using both BM25 term saturation in isolation and BM25 term saturation with idf, confirming that both are superior to their non-BM25 counterparts under several common clustering quality measures. Finally,
we investigate estimation of the k1 BM25 parameter when clustering. Our results indicate that typical values of k1 from other IR tasks are not appropriate for clustering; k1 needs to be higher. 相似文献