首页 | 本学科首页   官方微博 | 高级检索  
     检索      


A comparison of feature selection methods for an evolving RSS feed corpus
Authors:Rudy Prabowo  Mike Thelwall  
Institution:aSchool of Computing and Information Technology, University of Wolverhampton, Wulfruna Street, Wolverhampton WV1 1SB, UK
Abstract:Previous researchers have attempted to detect significant topics in news stories and blogs through the use of word frequency-based methods applied to RSS feeds. In this paper, the three statistical feature selection methods: χ2, Mutual Information (MI) and Information Gain (I) are proposed as alternative approaches for ranking term significance in an evolving RSS feed corpus. The extent to which the three methods agree with each other on determining the degree of the significance of a term on a certain date is investigated as well as the assumption that larger values tend to indicate more significant terms. An experimental evaluation was carried out with 39 different levels of data reduction to evaluate the three methods for differing degrees of significance. The three methods showed a significant degree of disagreement for a number of terms assigned an extremely large value. Hence, the assumption that the larger a value, the higher the degree of the significance of a term should be treated cautiously. Moreover, MI and I show significant disagreement. This suggests that MI is different in the way it ranks significant terms, as MI does not take the absence of a term into account, although I does. I, however, has a higher degree of term reduction than MI and χ2. This can result in loosing some significant terms. In summary, χ2 seems to be the best method to determine term significance for RSS feeds, as χ2 identifies both types of significant behavior. The χ2 method, however, is far from perfect as an extremely high value can be assigned to relatively insignificant terms.
Keywords:Feature selection  Chi-square  Mutual information  Information gain
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号