分布式环境下的文档相似度研究与实现 Research and Implementation of Textual Similarity in Distributed Environment期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

分布式环境下的文档相似度研究与实现

引用本文：	赵华茗. 分布式环境下的文档相似度研究与实现[J]. 现代图书情报技术, 2011, 0(Z1)

作者姓名：	赵华茗

作者单位：	中国科学院国家科学图书馆;

摘要：	针对传统的相似度计算方法在海量信息处理过程中暴露出的数据处理规模限制和性能不足等方面的瓶颈问题,以非结构化文档为研究对象,提出一种基于Hadoop分布式环境,结合Hive数据处理平台和PostgreSQL关系型数据库的文档相似度计算方法,并给出关键技术思路、具体实现步骤和实证研究,通过研究证明Hive SQL语言可有效简化分布式数据处理的复杂性,但实时性有待改进。
关键词：	Hadoop Hive 相似度非结构化
Research and Implementation of Textual Similarity in Distributed Environment

Zhao Huaming. Research and Implementation of Textual Similarity in Distributed Environment[J]. New Technology of Library and Information Service, 2011, 0(Z1)

Authors:	Zhao Huaming

Affiliation:	Zhao Huaming(National Science Library,Chinese Academy of Sciences,Beijing 100190,China)

Abstract:	Aiming at the performance issue and limitation on data set size in the process of mass-data mining of traditional similarity algorithm,this paper takes unstructured textual data as research subject and introduces the method of Hadoop distributed textual similarity algorithm,which combines Hive data mining platform with PostgreSQL RMDB,and describes the basic technical ideas,implementations and the empirical research in details.The testing result shows that Hive SQL can effectively simplify the complexity of...

Keywords:	Hadoop Hive Similarity Unstructured
本文献已被 CNKI 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏