首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Collection-based compound noun segmentation for Korean information retrieval
Authors:In-Su Kang  Seung-Hoon Na  Jong-Hyeok Lee
Institution:(1) Division of Electrical and Computer Engineering, Pohang University of Science and Technology (POSTECH), Advanced Information Technology Research Center (AITrc), PIRL 323, San 31, Hyoja-dong, Nam-gu, Pohang, 790-784, Republic of Korea
Abstract:Compound noun segmentation is a key first step in language processing for Korean. Thus far, most approaches require some form of human supervision, such as pre-existing dictionaries, segmented compound nouns, or heuristic rules. As a result, they suffer from the unknown word problem, which can be overcome by unsupervised approaches. However, previous unsupervised methods normally do not consider all possible segmentation candidates, and/or rely on character-based segmentation clues such as bi-grams or all-length n-grams. So, they are prone to falling into a local solution. To overcome the problem, this paper proposes an unsupervised segmentation algorithm that searches the most likely segmentation result from all possible segmentation candidates using a word-based segmentation context. As word-based segmentation clues, a dictionary is automatically generated from a corpus. Experiments using three test collections show that our segmentation algorithm is successfully applied to Korean information retrieval, improving a dictionary-based longest-matching algorithm.
Contact InformationJong-Hyeok LeeEmail:
Keywords:Compound noun segmentation  Unsupervised method  Korean information retrieval
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号