Compressing The Library Of Babel!
Presented by David Turner
!!Con 2018
The Library Of Babel
by Jorge Luis Borges
The universe (which others call the Library) is
composed of an indefinite, perhaps infinite number of
hexagonal galleries....
*
- Breakfast of Champions, by Kurt Vonnegut
- The Instructions, by Adam Levin
- House of Leaves, by Mark Z Danielewsky
- The Last Samurai, by Helen DeWitt
All possible books*
- 410 pages
- 40 lines per page
- 80 characters
- 22 letters (+ period, comma, and space)
101834097 books
You're gonna need a bigger disk
gzip?
- Typical compression ratio: 2:1
- Compression ratio here: 1:1
Compression
- On 'aaaaaaaaaaaaaaaaaaaaa....': 1000:1
- On 'tphr nfjlojalxptb pqxznvntzcsluyjrgqc....': 1:1.03
Pigeonhole Principle
If you have 101 pigeons, and 100 holes, two pigeons
will share a hole
Kolmogorov Complexity
or "descriptive complexity"
def generate(book, remaining):
if remaining == 0:
print "".join(book)
else:
for char in "":
book.append(char)
generate(book, remaining - 1)
book.pop()
generate([], 410 * 40 * 80)
Kolmogorov Complexity
- Size of compressed data +
- Size of decompressor...
- ... in some programming language
But wait!
Some pages are missing!
Representing missing data
The pigeonhole principle strikes again
Missing pages
[Complete contents of book]
[Page number of missing page]
Dense
Page bitmap
Compressed page bitmap
Just list what remains
The Real Library of Babel
(the good parts version)
Some compression ratios for Wikipedia
from "Large Text Compression Benchmark", by Matt Mahoney
- Gzip: 3.1:1
- Bzip2: 4:1
- phda9: 8.5:1
Marcus Hutter's Hypothesis
- Wikipedia makes sense
- To compress is to predict
- Prediction requires intelligence
- Compression is intelligence
The Hutter Prize
- Compress 100mb of Wikipedia better
- Earn a fraction of 50,000€
- Make compression smarter?
In truth, the Library includes all verbal structures, all variations permitted by the twenty-five orthographical symbols, but not a single example of absolute nonsense. It is useless to observe that the best volume of the many hexagons under my administration is entitled The Combed Thunderclap and another The Plaster Cramp and another Axaxaxas mlö. These phrases, at first glance incoherent, can no doubt be justified in a cryptographical or allegorical manner; such a justification is verbal and, ex hypothesi, already figures in the Library.