首页 | 本学科首页   官方微博 | 高级检索  
     检索      

多层次web文本分类
引用本文:凌云,刘军,王勋.多层次web文本分类[J].情报学报,2005,24(6):684-689.
作者姓名:凌云  刘军  王勋
作者单位:浙江工商大学计算机与信息工程学院,杭州,310035;浙江工商大学计算机与信息工程学院,杭州,310035;浙江工商大学计算机与信息工程学院,杭州,310035
基金项目:浙江省自然基金(No.M063149)
摘    要:传统的文本分类大多基于向量空间,分类体系为甲面体系,忽视了类别间的层次关系。根据LSA理论提出了一种多层次web文本分类方法。建立类模型时,根据类别的层次关系树由下到上逐层为具有相同父节点的类别建立一个类模型;分类时,由上到下,根据相应的类模型存LS空间上分类。这种分类方法解决了LSA模型中高维矩阵难以进行奇异值分解的问题。同时体现了web文本中词条的语义关系,注重了词条在网页中的表现形式。实验表明,多层次web文本分类方法比基于平面分类体系的分类方法在查全率和准确率方面要好。

关 键 词:文本分类  网页净化  LSA  LS空间
修稿时间:2004年12月27

Multi-hierarchial Classification of Web Text
Ling Yun,Liu Jun,Wang Xun.Multi-hierarchial Classification of Web Text[J].Journal of the China Society for Scientific andTechnical Information,2005,24(6):684-689.
Authors:Ling Yun  Liu Jun  Wang Xun
Abstract:The traditional text classifications are mostly based on the vectorial space,and the structure of classification is flat structure.These methods ignore the structural relationships among the categories.This text put forward a kind of multihierarchy web text classification according to LSA theory.This method set up a classifier for nodes that have the same father node from leaves to root according to classification tree.And it classifies a new web text according to the corresponding classifier in LS space from root to leaves.This method solved a flaw of LSA model.This flaw is that it is difficult to execute singular value decomposition for a large sparse matrix.This method not only reflects the semantic relationships of the terms in web text but also pays attention to the expressive form of terms in the webpage.Experiments show such multi-hierarchy web text classification method is more accurate than some methods which based on flat structure.
Keywords:text classification  pape cleaning  LSA  LS space    
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号