首页 | 本学科首页   官方微博 | 高级检索  
     


Exploiting structural information for semi-structured document categorization
Authors:Andrej Bratko  Bogdan Filipič
Affiliation:1. Klika, informacijske tehnologije d.o.o., Stegne 21c, SI-1000 Ljubljana, Slovenia;2. Department of Intelligent Systems, Jozef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia
Abstract:This paper examines several different approaches to exploiting structural information in semi-structured document categorization. The methods under consideration are designed for categorization of documents consisting of a collection of fields, or arbitrary tree-structured documents that can be adequately modeled with such a flat structure. The approaches range from trivial modifications of text modeling to more elaborate schemes, specifically tailored to structured documents. We combine these methods with three different text classification algorithms and evaluate their performance on four standard datasets containing different types of semi-structured documents. The best results were obtained with stacking, an approach in which predictions based on different structural components are combined by a meta classifier. A further improvement of this method is achieved by including the flat text model in the final prediction.
Keywords:Text categorization   Semi-structured documents   Document structure   Stacked generalization   Support vector machines
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号