Compressing The Library Of Babel!

Presented by David Turner

!!Con 2018

The Library Of Babel

by Jorge Luis Borges

The universe (which others call the Library) is composed of an indefinite, perhaps infinite number of hexagonal galleries....

All possible books*

*

  • Breakfast of Champions, by Kurt Vonnegut
  • The Instructions, by Adam Levin
  • House of Leaves, by Mark Z Danielewsky
  • The Last Samurai, by Helen DeWitt

All possible books*

  • 410 pages
  • 40 lines per page
  • 80 characters
  • 22 letters (+ period, comma, and space)

101834097 books

You're gonna need a bigger disk

gzip?

  • Typical compression ratio: 2:1
  • Compression ratio here: 1:1

How gzip works

Compression

  • On 'aaaaaaaaaaaaaaaaaaaaa....': 1000:1
  • On 'tphr nfjlojalxptb pqxznvntzcsluyjrgqc....': 1:1.03

Pigeonhole Principle

If you have 101 pigeons, and 100 holes, two pigeons will share a hole

Kolmogorov Complexity

or "descriptive complexity"


def generate(book, remaining):
    if remaining == 0:
         print "".join(book)
    else:
        for char in "":
            book.append(char)
            generate(book, remaining - 1)
            book.pop()

generate([], 410 * 40 * 80)
                

Kolmogorov Complexity

  • Size of compressed data +
  • Size of decompressor...
  • ... in some programming language

But wait!

Some pages are missing!

Representing missing data

The pigeonhole principle strikes again

Missing pages

[Complete contents of book]

[Page number of missing page]

Dense

Page bitmap

Compressed page bitmap

Just list what remains

The Real Library of Babel

(the good parts version)

Wikipedia

Compressing Wikipedia

Some compression ratios for Wikipedia

from "Large Text Compression Benchmark", by Matt Mahoney
  • Gzip: 3.1:1
  • Bzip2: 4:1
  • phda9: 8.5:1

Marcus Hutter's Hypothesis

  • Wikipedia makes sense
  • To compress is to predict
  • Prediction requires intelligence
  • Compression is intelligence

The Hutter Prize

  • Compress 100mb of Wikipedia better
  • Earn a fraction of 50,000€
  • Make compression smarter?

Borges's Objection

In truth, the Library includes all verbal structures, all variations permitted by the twenty-five orthographical symbols, but not a single example of absolute nonsense. It is useless to observe that the best volume of the many hexagons under my administration is entitled The Combed Thunderclap and another The Plaster Cramp and another Axaxaxas mlö. These phrases, at first glance incoherent, can no doubt be justified in a cryptographical or allegorical manner; such a justification is verbal and, ex hypothesi, already figures in the Library.