首页 | 本学科首页   官方微博 | 高级检索  
     检索      

一种基于词聚类信息熵的新闻提取方法
引用本文:牛伟农,吴 林,于水源.一种基于词聚类信息熵的新闻提取方法[J].教育技术导刊,2020,19(1):36-40.
作者姓名:牛伟农  吴 林  于水源
作者单位:中国传媒大学 智能融媒体教育部重点实验室,北京 100024
基金项目:中国传媒大学青年理工科规划项目(3132018XNG1834)
摘    要:互联网的飞速发展为大众带来便利的同时也产生了大量冗余信息,利用自然语言处理技术对新话题文章进行提取,控制新话题中虚假新闻传播,可为舆情控制提供有效支持。提出一种基于词聚类信息熵的新闻提取方法,并对“一带一路”相关新闻语料进行实验。实验通过网络爬虫的方式获取相关报道,利用Pkuseg工具分词进行一系列预处理后训练生成Word2vec词向量,然后利用词频统计筛选出历史高频词进行K-means聚类,将聚类后的词簇作为随机变量计算当前文章的信息熵。若文章的信息熵高于设定阈值,则为新话题文章,需要重点关注。结果表明,该方法在阈值设置为0.65时,新闻提取结果的准确率可达到84%。

关 键 词:新闻提取  新话题  词向量  聚类  信息熵  
收稿时间:2019-08-28

A News Extraction Method Based on Information Entropy of Word Clustering
NIU Wei-nong,WU Lin,YU Shui-yuan.A News Extraction Method Based on Information Entropy of Word Clustering[J].Introduction of Educational Technology,2020,19(1):36-40.
Authors:NIU Wei-nong  WU Lin  YU Shui-yuan
Institution:Key Laboratory of?Convergent Media and Intelligent Technology of Ministry of Education,Communication University of China, Beijing 100024,China
Abstract:The rapid development of the Internet has brought convenience to the public while generating a large amount of redundant information. Using natural language processing techniques to extract new topic articles can provide effective support for public opinion control, this paper proposes a news extraction method based on word clustering information entropy, and conducts experiments on the “One Belt, One Road” related news corpus. The experiment obtains relevant reports by web crawling. We use the Pkuseg tool to segment the corpus, and then perform a series of preprocessing operations such as removing the stop words and the background words. Then a word2vec word vector is generated for the processed corpus. The word frequency statistics are used to screen the historical high frequency words for k-means clustering. Then the word clusters are used as random variables to calculate the information entropy of the current article. If the information entropy of the article is higher than the set threshold, it is a new topic article and needs to be focused. The results show that the accuracy of the news extraction results can reach 84% when the threshold is set at 0.65.
Keywords:news extraction  new topic  word vector  clustering  information entropy  
点击此处可从《教育技术导刊》浏览原始摘要信息
点击此处可从《教育技术导刊》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号