Compressor Contest

August 15, 2013

If you want to squish text, here’s a useful resource. Blogger and tech strategist Matt Mahoney hosts a contest that puts lossless data compression programs to the test. Using a particular text dump, the English version of Wikipedia from March 3, 2006, he examines the compressed size of the data‘s first billion bytes. He explains the reason for the initiative:

“The goal of this benchmark is not to find the best overall compression program, but to encourage research in artificial intelligence and natural language processing (NLP). A fundamental problem in both NLP and text compression is modeling: the ability to distinguish between high probability strings like recognize speech and low probability strings like reckon eyes peach. . . .

“Compressors are ranked by the compressed size of enwik9 (109 bytes) plus the size of a zip archive containing the decompresser. Options are selected for maximum compression at the cost of speed and memory. Other data in the table does not affect rankings. This benchmark is for informational purposes only. There is no prize money for a top ranking.”

Still, bragging rights themselves will be worth it for the winner. See the write-up for all the technical details, including a detailed rundown of each compressor.

Cynthia Murrell, August 15, 2013

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.