Fun with Google n-grams data (part 1)

As part of the Google Books digitization project, Google created a dataset of n-gram frequencies. The Google Books Ngram Viewer is their presentation of this data, but Google also provides the raw data for anyone to download.

Recently, I’ve been playing with some of the data. In the past, I’ve played with randomly generating text with a Markov chain. You have some source text corpus, and you generate random text by picking the next word according to the probability distribution of some source text, conditioned on the previous $k$ words.

One of the available n-gram data sets contains the million most frequently English words, and I thought it’d be fun to try to run some kind of Markov-chain algorithm using this as the source corpus.

Since there’s a lot of data (the 2-gram dataset is 80 GB uncompressed, and the 3-gram dataset is 280 GB uncompressed), we want to have a compact datastructure to hold it. The hope is that with enough magic, the needed data can fit into 16 GB of RAM without swapping. Tries seem like a good fit for the problem, since all of the prefixes are shared between n-grams. In a trie, each node stores a map of keys to child nodes, but the keys for each node are unstored: the key for a node is given by the path from the root. Since we want to do some things with probabilities, each node has a frequency count, as well as a count of the sum of the frequencies of all child nodes.

But there’s a lot of data, so maybe we need to think about the way to implement this in a space-efficient way? I don’t care about portability to non-x64 systems (for this program), so we can do some bit-twiddling: the x64 address space only uses the lower 48 bits of the pointer to address memory, while the upper 16 bits are either all $1$ or all $0$.

This means, for instance, that you can have an implementation of (sufficiently small) vectors where the bookkeeping information is stored in the array pointer itself. To save memory, the node children are saved in a tagged-pointer data structure which is a pointer (tagged with information about array size) to an array of tagged pointers (tagged with the character used as a key for each node). The array is kept sorted, so we can do lookups fairly quickly with a binary search, and the whole thing takes only 8 bytes for the pointer to the array, plus 8 bytes times the array size (rounded up to the closest power of 2). Rounding to the closest power of 2 ends up wasting very little space, since the distribution of the number of children seems to follow a power law anyways.

The next problem is that, having compacted the size of all of the data we need to store, we’re left with having to do a ton of small allocations of memory blocks that are nearly pointer-sized. Not only is this inefficient in terms of speed, it’s also space-inefficient: the memory allocator has to keep track of what it’s allocated, which takes a minimum of 8 bytes. This is fairly easily solved by performing allocations out of memory pools of fixed size.

In the end, we can build a trie for the 2-gram dataset in about 4 GB of memory, and we can almost load the 3-gram dataset. But much better results are possible. The code I have now isn’t a great solution, since even after applying a whole number of too-clever hacks, it still doesn’t work, because the basic data structure isn’t a great fit. Having to keep pointers to different blocks of memory isn’t a good idea, since not only are the pointers really large (8 bytes), but they also preclude easy serialization, so large datasets can’t just be mmap’d. Not only can you not take advantage of the kernel’s paging system, you also have to reload the source data every time.

More importantly, the approach of trying to make the trie smaller is really focusing on the wrong question: “how do you use bit-twiddling hacks to shrink the implementation of this datastructure?” rather than “what kind of data structure uses the (information-theoretic) properties of the data to compress it, in such a way that we can perform computation on compressed data?”. Answering the first question instead of the second leads to unreadable code that’s too clever by half, but doesn’t really get you where you need to go. Answering the second question leads to really interesting things like succinct data structures and is generally the better way to go.

Indeed, it should be possible to do much better: this paper, for instance, claims to have a datastructure that can compress the entire terabyte-sized n-gram dataset ($n = 1,2,3,4,5$) into a 16GB datastructure.

(Addendum: “part 1” indicates hopefully more posts on the topic, but I don’t have anything written yet. Also, as it’s not directly-KDE-related, just (hopefully) interesting, if you’re reading this on planetKDE and don’t want to see it, complain loudly and I’ll leave future posts out of syndication…)