首页 | 本学科首页   官方微博 | 高级检索  
     


Character n-gram application for automatic new topic identification
Authors:Burcu Caglar Gencosman  Huseyin C. OzmutluSeda Ozmutlu
Affiliation:Uludag University, Industrial Engineering, Endustri Muh. Bolumu 3. kat Y306, Gorukle Kampusu, 16059 Bursa, Turkey
Abstract:The widespread availability of the Internet and the variety of Internet-based applications have resulted in a significant increase in the amount of web pages. Determining the behaviors of search engine users has become a critical step in enhancing search engine performance. Search engine user behaviors can be determined by content-based or content-ignorant algorithms. Although many content-ignorant studies have been performed to automatically identify new topics, previous results have demonstrated that spelling errors can cause significant errors in topic shift estimates. In this study, we focused on minimizing the number of wrong estimates that were based on spelling errors. We developed a new hybrid algorithm combining character n-gram and neural network methodologies, and compared the experimental results with results from previous studies. For the FAST and Excite datasets, the proposed algorithm improved topic shift estimates by 6.987% and 2.639%, respectively. Moreover, we analyzed the performance of the character n-gram method in different aspects including the comparison with Levenshtein edit-distance method. The experimental results demonstrated that the character n-gram method outperformed to the Levensthein edit distance method in terms of topic identification.
Keywords:Content-ignorant algorithms   The character n-gram method   New topic identification   The Levenshtein edit-distance   Pre-processed spelling correction methods
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号