Character N-Gram Tokenization for European Language Text Retrieval |
| |
Authors: | Paul McNamee James Mayfield |
| |
Institution: | 1. Applied Physics Laboratory, Johns Hopkins University, 11100 Johns Hopkins Road, Laurel, MD, 20723-6099, USA
|
| |
Abstract: | The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n = 4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|