Compressor Contest
August 15, 2013
If you want to squish text, here’s a useful resource. Blogger and tech strategist Matt Mahoney hosts a contest that puts lossless data compression programs to the test. Using a particular text dump, the English version of Wikipedia from March 3, 2006, he examines the compressed size of the data‘s first billion bytes. He explains the reason for the initiative:
“The goal of this benchmark is not to find the best overall compression program, but to encourage research in artificial intelligence and natural language processing (NLP). A fundamental problem in both NLP and text compression is modeling: the ability to distinguish between high probability strings like recognize speech and low probability strings like reckon eyes peach. . . .
“Compressors are ranked by the compressed size of enwik9 (109 bytes) plus the size of a zip archive containing the decompresser. Options are selected for maximum compression at the cost of speed and memory. Other data in the table does not affect rankings. This benchmark is for informational purposes only. There is no prize money for a top ranking.”
Still, bragging rights themselves will be worth it for the winner. See the write-up for all the technical details, including a detailed rundown of each compressor.
Cynthia Murrell, August 15, 2013
Sponsored by ArnoldIT.com, developer of