首页 | 本学科首页   官方微博 | 高级检索  
     检索      

大数据环境下文本情感分析算法的规模适配研究:以Twitter为数据源
引用本文:余传明,原赛,王峰,安璐.大数据环境下文本情感分析算法的规模适配研究:以Twitter为数据源[J].图书情报工作,2019,63(4):101-111.
作者姓名:余传明  原赛  王峰  安璐
作者单位:1. 中南财经政法大学信息与安全工程学院 武汉 430073; 2. 中南财经政法大学统计与数学学院 武汉 430073; 3. 武汉大学信息管理学院 武汉 430072
基金项目:本文系国家自然科学基金面上项目"大数据环境下基于领域知识获取与对齐的观点检索研究"(项目编号:71373286)和教育部哲学社会科学研究重大课题攻关项目"提高反恐怖主义情报信息工作能力对策研究"(项目编号:17JZD034)研究成果之一。
摘    要:目的/意义]以大数据环境下的文本情感分析这一特定任务为目的,对规模适配问题进行研究,为情报学领域研究人员进行大数据环境下数据分析时,实现效率和成本的最优选择提供借鉴。方法/过程]采用斯坦福大学Sentiment140数据集,在对传统情感分析算法分析的基础上,提出了5种面向大数据的文本情感分析算法,检验各种算法在不同环境和数据规模下的适配效果,从准确性、可扩展性和效率等方面进行实证比较研究。结果/结论]实验结果显示,本文所搭建的集群具有良好的运行效率、正确性以及可扩展性,Spark集群在处理海量文本情感分析数据时更具有效率优势,且在数据规模越大的情况下,效率优势越明显;在资源利用方面,随着节点数和核数的增加,集群的整体运行效率变化显著,配置5个4核4G内存的从节点,能够实现在高效完成分类任务的同时达到节约资源成本的效果。

关 键 词:规模适配  大数据  海量文本  情感分析  机器学习算法  
收稿时间:2018-05-09

Research on Scale Adaptation of Text Sentiment Analysis Algorithm in Big Data Environment: Using Twitter as Data Source
Yu Chuanming,Yuan Sai,Wang Feng,An Lu.Research on Scale Adaptation of Text Sentiment Analysis Algorithm in Big Data Environment: Using Twitter as Data Source[J].Library and Information Service,2019,63(4):101-111.
Authors:Yu Chuanming  Yuan Sai  Wang Feng  An Lu
Institution:1. School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430073; 2. School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan 430073; 3. School of Information Management, Wuhan University, Wuhan 430072
Abstract:Purpose/significance] This paper aims to study the scale adaptation problem for the purpose of textual sentiment analysis in big data environment. The paper provides reference for the best choice between efficiency and cost when researchers in the field of information science carry out data analysis under big data environment. Method/process] We use the Sentiment140 dataset of Stanford University. Based on the analysis of traditional sentiment analysis algorithms, we propose five textual sentiment analysis algorithms for big data to test the adaptation effectiveness of various algorithms under different environments and data sizes, and conduct empirical comparisons in terms of accuracy, scalability and efficiency. Result/conclusion] The experimental results show that the cluster built in this paper has good operational efficiency, correctness, and scalability. Spark clusters have more efficiency advantages in processing large-scale text sentiment analysis data, and with increasing the data size, its efficiency advantage is more obvious. In resource utilization, as the number of nodes and cores increase, the overall operating efficiency of the cluster changes significantly. We find the configuration of five slave nodes with 4 cores and 4G memory can achieve the effect of saving resource costs while efficiently completing the classification task.
Keywords:scale adaptation  big data  massive text  sentiment analysis  machine learning algorithm  
点击此处可从《图书情报工作》浏览原始摘要信息
点击此处可从《图书情报工作》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号