基于专业术语提取的中文分词方法 |
| |
引用本文: | 郑阳,莫建文. 基于专业术语提取的中文分词方法[J]. 大众科技, 2012, 14(4): 20-23 |
| |
作者姓名: | 郑阳 莫建文 |
| |
作者单位: | 桂林电子科技大学,广西桂林,541004 |
| |
基金项目: | 广西自然科学基金,广西科技开发项目 |
| |
摘 要: | 针对在科技文献中,未登录词等相关专业术语其变化多端,在中文分词中难以识别,影响了专业领域文章的分词准确度,结合实际情况给出了一种基于专业术语提取的中文分词方法。通过大量特定领域的专业语料库,基于互信息和统计的方法,对文中的未登录词等专业术语进行提取,构造专业术语词典,并结合通用词词典,利用最大匹配方法进行中文分词。经实验证明,该分词方法可以较准确的抽取出相关专业术语,从而提高分词的精度,具有实际的应用价值。
|
关 键 词: | 专业术语 互信息 未登录词 正向最大匹配 中文分词 |
A Chinese word segmentation method based on professional term extraction |
| |
Abstract: | According to some unknown words,such as related professional term which have some forms in science and technology literature,it is hard to distinguish and influence the Chinese word segmentation accuracy,this is a Chinese word segmentation method based on professional term extraction.Through a large number of specific areas of professional corpus,based on mutual information and statistics method,to get unknown words such as professional term,make a professional term dictionary and combined with general word dictionary,use positive maximal matching algorithm for the Chinese word segmentation.Proved by some experiments,this word segmentation method can accurately get professional term and improve accuracy of segmentation which has high practical application value. |
| |
Keywords: | professional term mutual information Out-of-Vocabulary positive maximal matching algorithm chinese word segmentation |
本文献已被 CNKI 万方数据 等数据库收录! |
|