Lightweight natural language text compression |
| |
Authors: | Nieves R Brisaboa Antonio Fariña Gonzalo Navarro José R Paramá |
| |
Institution: | 1.Database Lab., Univ. da Coru?a, Facultade de Informática,A Coru?a,Spain;2.Center for Web Research, Dept. of Computer Science,Univ. de Chile, Blanco Encalada,Santiago,Chile |
| |
Abstract: | Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress
natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed
text and random access capabilities, in exchange for producing around 11% larger compressed files. This work describes End-Tagged
Dense Code and (s, c)-Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler
and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search
and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching
only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60% faster
than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code variants for various reasons: they
are simpler to program, faster to build, of almost optimal size, and as fast and easy to search as the best Huffman variants,
which are not so close to the optimal size.
|
| |
Keywords: | Text databases Natural language text compression Searching compressed text |
本文献已被 SpringerLink 等数据库收录! |
|