首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于类别的CHI特征选择方法
引用本文:梁伍七,李斌,许磊.基于类别的CHI特征选择方法[J].安徽广播电视大学学报,2015(3):124-128.
作者姓名:梁伍七  李斌  许磊
作者单位:安徽广播电视大学信息与工程学院,合肥,230022
摘    要:文本分类问题中,卡方特征选择是一种效果较好的特征选择方法。计算单词的卡方值时,先计算单词针对每个类别的卡方值,再通过类别概率将卡方值调和平均,作为单词相对于整个训练集合的卡方值,这种全局方法忽视了单词和类别间的相关性。针对这一问题,提出基于类别的卡方特征选择方法。基于类别的方法针对每个类别遴选特征词,特征词数量根据事先设定的阈值、类别的文档数和整个训练集合文档数计算得到,不同类别的特征空间可能包含相同的特征词。采用KNN分类方法,将基于类别的方法与全局方法进行比较,实验结果表明,基于类别的方法能够提高分类器的总体性能。

关 键 词:文本分类  卡方  特征选择  特征词  KNN分类

CHI Feature Selection Method Based on Category
Abstract:In text categorization ,chi‐square feature selection is a better feature selection method .While the chi‐square value of a word being calculated ,the chi‐square value for each category is calculated first ,and then the harmonic mean is calculated by the category probability and the chi‐square value ,w hich serves as the chi‐square value of the word for the entire training set .This global approach ignores the correlation between words and categories .Aiming at this problem ,a chi‐square feature selection method based on category is pro‐posed ,which chooses feature words for each category .The number of feature words is calculated by pre‐set threshold ,the number of documents in the category and the number of documents in the entire training set . The feature space of different categories may contain the same feature words .Using K Nearest Neighbor (KNN) method ,it is compared with the global feature selection approach .Experimental results show the chi‐square feature selection method based on category can improve the overall performance of the classifier .
Keywords:text categorization  Chi-square  feature selection  feature word  KNN categorization
本文献已被 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号