Human Knowledge Compression Contest
Frequently Asked Questions & Answers
[back to the prize page]
Prize Medal Prize Medal
Compress the 100MB file enwik8 to less than the current record of about 16MB


Frequently Asked Questions

What is this contest about?

The contest is about compressing the human world knowledge as well as possible. There's a prize of initially 50'000€ attached to the contest. If your compressor compresses the 100MB file enwik8 x% better than the current record, you'll receive x% of the prize. The contest is motivated by the fact that compression ratios can be regarded as intelligence measures. See http://prize.hutter1.net/ for details.

What is (artificial) intelligence?

Intelligence has many faces, like creativity, solving problems, pattern recognition, classification, learning, induction, deduction, building analogies, optimization, surviving in an environment, language processing, knowledge, and many more. A formal definition incorporating all or at least most aspect of intelligence is difficult but not impossible. Informally, intelligence is an agent's ability to achieve goals in a wide range of environments. The key concepts for a formal definition are compression and utility maximization, the other aspects are emergent. phenomena.

What does compression has to do with (artificial) intelligence?

One can prove that the better you can compress, the better you can predict; and being able to predict [the environment] well is key for being able to act well. Consider the sequence of 1000 digits "14159...[990 more digits]...01989". If it looks random to you, you can neither compress it nor can you predict the 1001st digit. If you realize that they are the first 1000 digits of pi, you can compress the sequence and predict the next digit. While the program computing the digits of pi is an example of a one-part self-extracting archive, the impressive Minimum Description Length (MDL) principle is a two-part coding scheme akin to a (parameterized) decompressor plus a compressed archive. If M is a probabilistic model of the data, then the data can be compressed to length log(1/P(x|M)). The decompressor must know M, hence has length L(M). One can show that the model M that minimizes the total length L(M)+log(1/P(x|M)) leads to best predictions of future data. For instance, the quality of natural language models is typically judged by its perplexity, which is essentially a compression ratio. Finally, sequential decision theory tells you how to exploit such models M for optimal rational actions.

The contest encourages developing special purpose compressors tailored towards enwik8, rather than general purpose compressors.

The top compressors may be somewhat tuned to enwik8, but the major progress will not come from coding e.g. thousands of grammatical and semantic rules to compress frequent phrases well. A compressor that learns these rules from the data itself will lead to better compression. Such a compressor will compress other text corpora well too.

Why lossless compression?

Doesn't a lossless compressor wastes resources on coding random noise and (spelling) errors in enwik8, in contrast to lossy compressors like the human brain?   ---   Not really, due to the following reasons:
  1. The test data is very clean. Misspelled words and grammatical errors are very rare. This is one reason why we chose this data set. See http://mattmahoney.net/dc/textdata.html for an analysis of the data.
  2. Even if the corpus would contain a lot of noise, lossless compression is still the right way to go for. One can show that among the shortest codes for a noisy data corpus, there is a two-part code of length l(A)+l(B), where A contains all "useful" information and B contains all noise. The theory is called "algorithmic statistics", while in practice two-part MDL is used. A probabilistic model M plus the log-likelihood of the data under the model: CodeLength(x) = -log P(x|M) + L(M). Or briefly: Noise does not at all harm the strong relation between compression and understandng/intelligence/predictability/etc.
  3. It is not always clear what accounts for errors/noise and what for useful data. Consider e.g. some keyboard-layout, hence country-related, typos like y versus z. There is probably a correlation between the type of Wikipedia entry and the type of typo. A compressor finding such a correlation will compress better. Figuring out this correlation can lead to useful insight. Simply correcting the errors would miss this piece of human "culture". Hence it is not clear how to setup a fair lossy compression contest.
  4. Compressors can always ignore errors or noise; the decompressor reconstructs the error-corrected enwik8. Finally, a table of errors is used to reintroduce them. In this way any lossy compressor can be converted to a lossless compressor.
  5. All compressors have to deal with the same noise and errors. The extra information needed to encode the exact text is small, and is the same for all contestants. In this sense lossless compression is not harder than lossy compression.
  6. Human brains are probably lossy compressors, but this does not invalidate the strong relation between lossless compression and AI. The competition is not about imitating or emulating a human brain. It is about creating rational intelligences. Suitably storing what today looks like noise and errors does not harm and allows future reconsideration.
  7. Lossless compression implies AI, but lossy compression doesn't.

Why don't you allow using some fixed default background knowledge database?

Since Wikipedia contains enough human knowledge in the text itself: If advantageous, some advanced statistical and/or other smart but relatively short algorithms could extract and create such a probably more structured "database of knowledge" from Wikipedia. If this turns out to be impossible, it would indicate that enwik8 is not a representative sample of human knowledge, and enwik9 or other sources should be used.

Why is "understanding" of the text or "intelligence" needed to achieve maximal compression?

Consider the following little excerpt from a Wikipedia page, where 12 words have been replaced by the placeholders [1]-[12].
In [1], HTML and XML documents, the logical constructs known as character data and [2] [3] consist of sequences of characters, in which each [4] can manifest directly (representing itself), or can be [5] by a series of characters called a [6] reference, of which there are [7] types: a numeric character reference and a character entity [8]. This article lists the [9] entity [10] that are valid in [11] and [12] documents.
If you do not understand any English, there is little hope that you can fill in the missing words. You may assign a high probability that the missing words equal some of the other words, and a small probability to all other strings of letters. Working on a string pattern matching level you may conjecture [11]=HTML and [12]=XML, since they precede the word "documents", and similarly [9]=character preceding "entity". If you do understand English you would probably further easily guess that [4]=character, [5]=represented, and [7]=two, where [4] requires to understand that characters are "in" sequences, [5] needs understanding of conjugation, and [7] needs understanding the concept of counting. To guess [8]=reference probably needs some understanding of the contents of the article, which then easily implies [10]=references. As a Markup Language expert, you "know" that [2]=attribute, [3]=values, and [6]=character, and you may conjecture [1]=SGML. So clearly, the more you understand, the more you can delete from the text without loss. If a program reaches the reconstruction capabilities of a human, it must be assigned equal understanding. (Sure, many will continue to define intelligence as what a machine can't do, but that will make them ultimately non-intelligent). (Y cn ccmplsh lt wtht ndrstndng nd ntllgnc, bt mr wth).

Why do you focus on text?

Visual (and audio and tactile) knowledge precedes linguistic knowledge and seems crucial to ground symbolic knowledge.

While the first part is true, the second is questionable. We decided not to include photo, video, and audio material in the file for the following reasons: Natural text, in particular novels, contains ample of descriptions of 3-dimensional scenes (the white house right of the beautiful tree under which ...). It is possible to extract the meaning of nouns and (directional) propositions and all other words from a sufficiently large corpus of text in an unknown language. Meaning is codified by relating words to other words (house-artificial, house-residence, house-right-tree), so a (spatial) representation of the world can be established without ever having seen a single picture This is akin to being able to "visualize" abstract mathematics, given enough exposure and skills. So it is plausible that any knowledge that can be demonstrated over a text-only channel as in the Turing test, can also be learned over a text-only channel, as e.g. evidenced by the blind and deaf writer Helen Keller. Given that higher (conscious) cognition is essentially symbolic, and a symbolic representation and understanding can be extracted from text alone, it is plausible that textual information is sufficient for the purpose of this context. Inclusion of photo, video, and audio material would require the contestants to deal with and improve upon not only state-of-the-art textual but also many kinds of non-textual compressors and deal with terabyte of video. Most of the compression would be about modeling physics rather than cognition. This puts enormous unnecessary extra burden upon the contestants. and given the arguments above, without real benefit.

What is the ultimate compression of enwik8?

By definition, the Kolmogorov complexity K of a string x is defined as the length of the shortest program (self-extracting archive) computing x. K(x) itself cannot be computed, only approximated from above, namely by finding better and better compressions, but even if we reach K(x) will never know whether we have done so. For a text string like enwik8, Shannon's estimate suggests that enwik8 should be compressible down to 12MB.

Can you prove the claims you make in the answers to the FAQ above?

Most of the assertions above can be formalized and proven, but some mathematical background is necessary. Good science books to start with are M. Li & P.M.B.Vitanyi, Kolmogorov Complexity, 1997, and M.Hutter, Universal Artificial Intelligence, 2004.

The PAQ8 compressors are already so good that it will be difficult to beat them.

Yes, it will not be easy (nobody said this), but there is likely a lot of room for improvement. PAQ8 models text only. There has been lots of research in language modelling (mostly for speech recognition) at the syntactic and semantic levels, but these are usually offline models on preprocessed text where words are mapped to tokens, and messy details like capitalization, punctuation, formatting, and rare words are removed. So far, nobody has figured out how to integrate these two approaches, but when that happens we will have a very powerful text compressor. Also from Shannon's estimates, that human text contains about 1 bit per character information, enwik8 should be compressible down to 12MB.

There are lots of non-human language pieces in the file.

Enwik8 is about 75% clean text and 25% "artificial" data of various forms (tables, hypertext links, XML, etc). As long as there is some natural language text (and enwik8 contains 75%), compressing it is a hard problem in AI, regardless of whatever else is present in the file. The "artificial" data is also human-generated, and belongs to the human knowledge base. So it is not that devious to ask to model that too. Ideas of filtering out the "artificial" data or using other apparently cleaner text-sources didn't prove superior.

Including the decompressor size encourages obfuscation

Doesn't adding the decompressor length to the length of the archive encourage developing unreadable short decompressors, rather than smart compressors based on understanding the text corpus enwik8?

The typical size of current decompressors is less than 100KB. So obscuring them by making them shorter will give you at most an 0.1% = 100KB/100MB advantage (not enough to be eligible for the prize). On the other hand, for fairness it is necessary to include the size of the decompressor. Take a compressor like PAQ8H that contains 800KB of tokens. Clearly it can achieve better compression than one from scratch. If you're not convinced by this argument, consider an extreme "decompressor" of size 100MB that simply outputs enwik8 byte by byte (from a zero byte archive), thus achieving a compressed size of 0.

Why do you use the 100MB enwik8 and not 1GB or all of Wikipedia?

Indeed, there are good arguments for the 1GB corpus. You can find them at http://mattmahoney.net/dc/rationale.html. Nevertheless we decided to start initially with the 100MB file enwik8 rather than the 1GB file enwik9 for the following reasons:
  1. Although enwik9 is not a hardship in the longrun, it is a hardship now. The contest will have highest participation when the prize will be initiated, so the file-size should be appropriate for moderate-cost home-computers.
  2. 100MB already contains a lot of Human knowledge.
  3. Typical decompressors are significantly less than 1% of 100MB, so there is no problem here.
  4. Enwik8 could still be downloaded by an analog modem.
  5. It's easy to upgrade to enwik9 later.
  6. Many standard compressors crash on enwik9 or start thrashing.
  7. It allows non-streaming offline (de)compressors that need the whole file in the RAM in a straightforward way. Enwik9 needs to be processed block-wise unless you have a couple of GB RAM available.
  8. Maybe the most compelling one: We want to encourage smart compressors. Smart compressors can easily take ridiculous time already on small corpora. So if we would go for enwik9, we would lose some of the smarter ideas for (de)compressors.

Why are you limiting validation to less than 10 hours on systems with less than 1GB RAM?

Some of the best minds in the world are limited by the computers they can afford. Typical PCs and Laptops bought in 2006 are equipped with 1GB RAM or less, let alone older systems and developing countries. One of our most successful contestants wrote us: "I still don't have access to a PC with 1 Gb or more, have found only 512 Mb computer, but it's about 7 or 8 km away from my home (~20 minutes on the bicycle I use), thus I usually test on a 5 Mb stub of enwik8, with -5, using only ~210 Mb". Upgrading to 2GB for about 100€ sounds cheap (for many but not all) compared to the involved prize money and invested effort. But lets assume 1000 potential participants (sure, fewer will succeed), and, say, half of them first need to upgrade their machines, then overall 500×100€ = 50'000€ is invested in hardware, i.e. the prize supports the memory chip industry, and overall the participants suffer a loss, which is definitely not the aim of the prize. So 1GB RAM for the whole system (800MB working RAM) was chosen as a widely-affordable system. Also the overall ratio of 1/10 : 1 : 10 GB of enwik8 : RAM : HD seems balanced.
So far no aspirant had problems with the time constraint. Future smarter compressors we are aiming at probably need a lot more time to "meditate" about enwik8 in order to compress it well. For this and other reasons it is allowed to send us the already compressed archive instead of the compressor. Only the decompressor needs reasonable speed. Since decompression is much easier than compression, the time bound is hopefully not too severe. If it is, let us know. Instead of a hard time-bound one could think of penalizing slow decompressors. In theory one should penalize a decompressor by one bit for every factor of two it is slower than its competitor (keyword: Levin complexity). In practice one would need a much stronger penalty.

The total prize is not exactly 50'000€

Right! The prize formula is a compromise between many considerations (simplicity, fairness, risk, and others). Since it is principally impossible to know what the ultimate compression of enwik8 will be, a prize formula leading to an exact total payout of 50'000€ is impossible. So no formula allows to tell what total amount will be awarded. Theoretically you can win more than 50'000€. For instance, if you halve the current record, then halve your own record, and then again halve it again, you would receive 3×25'000€. Realistically (based on past experience) we expect only a 3% improvement per year for the next 10 years, but you may be the one having a breakthrough of 30% improvement tomorrow. In theory, the current formula also allows optimizing your profit by sending e.g. three 3% improvements in monthly intervals, rather than one 9% improvement. You would gain scanty 136€ at the risk that someone sneaks in. (An alternative prize formula Z×ln(L/S) would prevent this gain). The total payout will be roughly 50'000€ if enwik8 can be compressed down to 7MB (in 3% steps), about the lower bound of Shannon's estimate of 0.6 bit per character.

Is 100MB 100×2^20 byte or 10^8 byte?

Most computer scientists use the prefix K(ilo) for 2^10, M(ega) for 2^20, and G(iga) for 2^30, while the International System of Units (SI, 7th edt.1998, p.103) defines K(ilo), M(ega), G(iga) as 10^3, 10^6, 10^9, respectively, and explicitly forbids misusing them for powers of 2. So the answer depends on whether you are a geek or a pedant. A good solution is to regard K/M/G as standing for "roughly" 10^3 or 2^10, etc, and avoid using them if more precision is needed, as is done on the contest page. Enwik8 is 10^8 byte. See MAA Online Counting by Twos (1999) for other solutions and references. In any case this is tangential to the contest.

Why do you require Windows or Linux executables?

There is a myriad of options we could allow for the contest. The related Large Text Compression Benchmark for instance is very flexible in this respect. This is convenient for contributors, but makes a fair comparison more difficult and fraud easier. Ideal would be one platform (e.g. Turing's Universal Machine). In real life, Windows and Linux executables cover most needs. Most programs are written in C(++), which neatly compiles to self-contained reasonably-sized executables.

How can I produce self-contained or smaller decompressors?

Is Artificial General Intelligence possible?

There have been many arguments for and against the possibility of creating Artificial General Intelligence. If AGI is not possible, what's the relevance of this contest?

The primary argument that AGI is possible is the computational hypothesis (supported by physical evidence) that the universe including organic matter and human brains are computable. Estimates of the complexity of a human brain look promising too, although they are more controversial. The second argument is the steady progress (hardware and software) in the last century in solving increasingly complex AI problems. Third, all arguments against AGI so far (at least those worth to refute) have been refuted; most are not even plausible in the first place. The primary reason for their perseverance seems to be fear of change and lack of imagination. Scientific arguments include no free lunch theorems, alleged experimental evidence against Ockham's razor, Lucas/Penrose Goedel argument, and some others. Philosophical arguments that a machine cannot have free will, imagination, emotions, a soul, cannot be conscious, creative, etc. have been put forward. Actually, one can find claims that 'a machine will never be able to do X' for literally hundreds of X. Further, there are all kinds of religious arguments. Finally, even if the great AGI goal is denied, the close relation between compression, prediction, and intelligence is undeniable, and hence this contest will in any case advance the field of text understanding and AI.

Is Ockham's razor and hence compression sufficient for AI?

Ockham’s razor principle is well supported theoretically and experimentally, and there is no other similarly general and powerful principle which could replace or augment it. Ockham's razor principle has been proven to be invaluable for understanding our world. Indeed, it not only seems a necessary but also sufficient founding principle of science. Until other necessary or sufficient principles are found, it is prudent to accept Ockham's razor as the foundation of inductive reasoning. So far, all attempts to discredit the universal role of Ockham's razor have failed. Arguments include no free lunch theorems and some questionable experimental evidence.

I have other questions or am not satisfied with the answer

Matt Mahoney's page on the rationale of the contest may help you. It explains why compression is equivalent to AI, why text compression is AI-Hard, what's the problem with other evaluation measures, how much compression can be achieved, why Wikipedia, why only English, why only text and not a variety of data types, why include the decompressor size, why not rank by speed and memory, what has been achieved so far, and others. If all this doesn't help, read or post your questions at the H-Prize newsgroup.

 © 2000 by ... [home] [search] [science] [contact] [up] [prize] ... Marcus Hutter