首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于统计的地质专业词语识别方法
引用本文:王 宏,朱学立,曾 涛,乔东玉,郭甲腾. 一种基于统计的地质专业词语识别方法[J]. 教育技术导刊, 2020, 19(4): 211-218. DOI: 10. 11907/rjdk. 191648
作者姓名:王 宏  朱学立  曾 涛  乔东玉  郭甲腾
作者单位:1. 河南省地质调查院;2. 河南省金属矿产成矿地质过程与资源利用重点实验室,河南 郑州 450000;3. 东北大学 资源与土木工程学院,辽宁 沈阳 110000
基金项目:国家自然科学基金项目 (41671404);中央高校基本科研业务费项目 (N170104019);中国地质调查局智能地质调查支撑平台建设项目(DD20160355)
摘    要:中文分词是地质大数据智能化知识挖掘难以回避的第一道基本工序。基于统计的分词方法受语料影响,跨领域适应性较差。基于词典的分词方法可以直接利用领域词典进行分词,但不能解决未登录词识别问题。在领域语料不足的情况下,为提高地质文本分词的准确率和未登录词识别率,提出一种基于统计的中文地质词语识别方法。该方法基于质串思想构建了地质基本词典库,用以改善统计分词方法在地质文本分词上的适应性。采用重复串查找方法得到地质词语候选集,并使用上下文邻接以及基于位置成词的概率词典,对地质词语候选集进行过滤,最终实现地质词语识别。实验结果表明,使用该方法对地质专业词语识别准确率达到81.6%,比通用统计分词方法提高了近60%。该方法能够识别地质文本中的未登录词,并保证地质分词的准确率,可以应用到地质文本分词工作中。

关 键 词:地质文本  中文分词  质串  重复串  上下文邻接  位置成词概率  
收稿时间:2019-05-05

A Method of Geologic Words Identification Based on Statistics
WANG Hong,ZHU Xue-li,ZENG Tao,QIAO Dong-yu,GUO Jia-teng. A Method of Geologic Words Identification Based on Statistics[J]. Introduction of Educational Technology, 2020, 19(4): 211-218. DOI: 10. 11907/rjdk. 191648
Authors:WANG Hong  ZHU Xue-li  ZENG Tao  QIAO Dong-yu  GUO Jia-teng
Affiliation:1. Henan Institute of Geological Survey;2. Henan Key Laboratory for Metalogenetic Process of Metal Mineral Resource andResource Utilization, Zhengzhou 450000,China;3. School of Resources and Civil Engineering, Northeastern University,Shenyang 110000,China
Abstract:Chinese word segmentation is the first basic process which is difficult to avoid in the intelligent knowledge mining of geological data. Word extraction based on statistics have poor performance across domain which is affected by corpus, the method based on dictionary can directly use the domain dictionary, but the problem of unlisted words recognition can not be resolved. In the case of insufficient domain corpus, a method of Chinese geological words recognition based on statistics is proposed, aiming at improving the accuracy of geological text segmentation and unlisted words recognition. Using prime string, the paper firstly constructs a base words library in geology, which has better performance across domain, then the geological words candidate set can be obtained by the algorithm of repeated string, and the final words can be recognized by using context adjacency analysis and position word probability to filter the candidate set. The experimental results show that the accuracy of the method is 81.6%, which is nearly 60% higher than that of the general statistical word segmentation method. This method is able to identify the unlisted geological words and ensure the accuracy, which can be applied to geological text segmentation.
Keywords:geologic text   Chinese word segmentation   prime string   repeated string   context adjacency analysis   position word probability  
点击此处可从《教育技术导刊》浏览原始摘要信息
点击此处可从《教育技术导刊》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号