首页 | 本学科首页   官方微博 | 高级检索  
     检索      


On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages
Authors:Jakub Piskorski  Karol Wieloch  Marcin Sydow
Institution:(1) Joint Research Centre of the European Commission, Via Fermi 2749, 21027 Ispra, Italy;(2) Poznań University of Economics, al. Niepodległości 10, 61-875 Poznan, Poland;(3) Web Mining Lab, Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland
Abstract:Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was on English and few other major languages. This article reports on knowledge-poor methods for tackling person name matching and lemmatization in Polish, a highly inflectional language with complex person name declension paradigm. These methods apply mainly well-established string distance metrics, some new variants thereof, automatically acquired simple suffix-based lemmatization patterns and some combinations of the aforementioned techniques. Furthermore, we also carried out some initial experiments on deploying techniques that utilize the context, in which person names appear. Results of numerous experiments are presented. The evaluation carried out on a data set extracted from a corpus of on-line news articles revealed that achieving lemmatization accuracy figures greater than 90% seems to be difficult, whereas combining string distance metrics with suffix-based patterns results in 97.6–99% accuracy for the name matching task. Interestingly, no significant additional gain could be achieved through integrating some basic techniques, which try to exploit the local context the names appear in. Although our explorations were focused on Polish, we believe that the work presented in this article constitutes practical guidelines for tackling the same problem for other highly inflectional languages with similar phenomena.
Contact Information Marcin SydowEmail:
Keywords:Person name matching  Highly inflectional languages  Lemmatization  String distance metrics
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号