On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages |
| |
Authors: | Jakub Piskorski Karol Wieloch Marcin Sydow |
| |
Institution: | (1) Joint Research Centre of the European Commission, Via Fermi 2749, 21027 Ispra, Italy;(2) Poznań University of Economics, al. Niepodległości 10, 61-875 Poznan, Poland;(3) Web Mining Lab, Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland |
| |
Abstract: | Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various
NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was
on English and few other major languages. This article reports on knowledge-poor methods for tackling person name matching
and lemmatization in Polish, a highly inflectional language with complex person name declension paradigm. These methods apply
mainly well-established string distance metrics, some new variants thereof, automatically acquired simple suffix-based lemmatization
patterns and some combinations of the aforementioned techniques. Furthermore, we also carried out some initial experiments
on deploying techniques that utilize the context, in which person names appear. Results of numerous experiments are presented.
The evaluation carried out on a data set extracted from a corpus of on-line news articles revealed that achieving lemmatization
accuracy figures greater than 90% seems to be difficult, whereas combining string distance metrics with suffix-based patterns
results in 97.6–99% accuracy for the name matching task. Interestingly, no significant additional gain could be achieved through
integrating some basic techniques, which try to exploit the local context the names appear in. Although our explorations were
focused on Polish, we believe that the work presented in this article constitutes practical guidelines for tackling the same
problem for other highly inflectional languages with similar phenomena.
|
| |
Keywords: | Person name matching Highly inflectional languages Lemmatization String distance metrics |
本文献已被 SpringerLink 等数据库收录! |
|