首页 | 本学科首页   官方微博 | 高级检索  
     


Preferential text classification: learning algorithms and evaluation measures
Authors:Fabio Aiolli  Riccardo Cardin  Fabrizio Sebastiani  Alessandro Sperduti
Affiliation:1. Dipartimento di Matematica Pura e Applicata, Università di Padova, Via Trieste, 63-35121, Padova, Italy
2. Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Via Giuseppe Moruzzi, 1-56124, Pisa, Italy
Abstract:In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form “for document d i , category c′ is preferred to category c′′”; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号