Bitpacking and Compression of Sparse Datasets
Originally posted 2016-10-18
Obligatory disclaimer: all opinions are mine and not of my employer
I’m working on a neural network to play Go. To train it, I’m using a dataset of 100,000 game records (which, when played through from start to finish, results in about 30,000,000 game positions). Each position is represented as a 19x19x28 array of floats, which is then handed to the GPU for neural network computation. It’s slow and wasteful to reprocess my dataset each time I train my neural network, so I precompute each position’s representation ahead of time and write it to disk.
Unfortunately, I discovered while scaling up my data pipeline that my processed dataset wouldn’t fit on my SSD: the raw representation of the dataset is 3e7 positions * 19 * 19 * 28 floats * 4 bytes per float = 1.2 TB. I decided that to fix this problem, I’d just compress the data. My data actually consists entirely of 1s and 0s, but I’m using 32-bit floats in my arrays because TensorFlow / my GPU expect floats as input. So, in theory, I should be able to represent my data as bits instead of floats, resulting in a 32x compression. Additionally, many of the 19x19 planes are one-hot encodings with a max value of 8 - i.e., transforming [0, 1, 3, 11] into [[0,0,0,0,0,0,0,0], [1,0,0,0,0,0,0,0], [0,0,1,0,0,0,0,0], [0,0,0,0,0,0,0,1], so I expected another ~8x compression.
The easiest compression algorithm to call from Python was gzip, so I started there. Gzip with default parameters gave me a 120x compression ratio. By my reasoning, gzip nabbed the 32x for the float32->bit conversion and another 4x compression on top of that. That sounded about right, and I was so enamoured by the huge compression ratio that, at first, I didn’t mind the 16 hours it took to preprocess the full dataset.
The problem with taking 16 hours to preprocess the data is that I have to rerun it every time I want to change my position representation – an important step in improving my AI. Making a change and then waiting a whole day to find out the results is frustrating and really slows down progress. This seems like a fairly parallelizable problem, so I figured the problem was Python using just one of my six cores – it’s all the Global Interpreter Lock’s fault, right? But blaming a lack of parallelism is an easy way to miss easier performance wins.
Optimizing the data pipeline
Recently, I decided to investigate ways to make this preprocessing faster. I started by taking a small slice of the complete dataset, which yielded 2GB of uncompressed data (17MB compressed). By profiling the data processing code, I found that processing took 19 seconds (11%), casting the numpy array to bytes took 2 seconds (1%), and compressing/writing the bytes to disk took 152 seconds (88%). That means that I should focus on optimizing the compress/write part of the code. If I were to optimize this part completely, I could speed up my preprocessing by a factor of ~8x.
The easiest thing to try first was swapping out gzip for another compression library. I wanted something that prioritized speed over compression, so I started with Google’s snappy library. Snappy compressed 9 times faster than gzip, but only achieved 10x compression, relative to gzip’s 120x compression. While I was impressed by snappy’s speed, I didn’t want my full preprocessed dataset to take up 100GB. I’m sure snappy works well for the web / RPC traffic that Google designed it for, but it wasn’t right for this task.
Compression algorithms have to balance speed vs compression, so I
started looking for a something in between gzip and snappy. I then
discovered that gzip offered multiple compression levels, and that
Python’s gzip wrapper defaulted to maximum compression. That explained
why my original gzip implementation was so slow. Switching to
compresslevel=6 compressed 4x faster than
compresslevel=9, and the compression ratio only dropped
from 120x to 80x. Not bad. If I had stopped here, I’d be pretty happy
with this tradeoff.
|compression||time to process (s)||time to convert to bytes (s)||time to compress and write (s)||total time (s)||output size (bytes)|
Manually compressing the data
But wait, there’s more! I had been assuming that because gzip was a compression algorithm, it was supposed to be able to figure out ridiculously obvious things like “my data consists entirely of 32-bit representations of 0.0 and 1.0”. Apparently, this is not the case.
Converting my float32’s to uint8’s is a 4x compression. Bitpacking each value into one bit gives a 32x compression. What happens when I run gzip after bitpacking my data?
It turns out that gzipping after bitpacking yields a 1000x compression. Even on its highest compression settings, gzip was leaving a 8x compression on the table when applied to the raw data. It turns out that if you know the structure of your own data, you can very easily do much, much better than a generic compression algorithm. – on both speed and compression.
I investigated all possible combinations of bitpacking and compression algorithms, yielding the following table. (Half bitpack refers to converting float32 to uint8.)
|bitpack||compression||time to process (s)||time to convert to bytes (s)||time to compress and write (s)||total time (s)||output size (bytes)|
In this table, we see that the code to process the positions takes about 19 seconds. Interestingly enough, creating a bytearray from a numpy array of float32s (~2 seconds) is actually slower than casting that numpy array to uint8, then creating a bytearray which is 4x smaller (~1 second). Compression times varied widely, but all compression algorithms got much faster when they had less raw data to work through.
The clear winner was fully packing bits, followed by gzip (compression level 6). This yields a 6x smaller file, 28x faster than my original gzip implementation. The overall runtime dropped from 174 seconds to 25 seconds - a 7x speedup. Compression and writing is now so fast that there’s no point in further optimizing it. Instead, my data processing code is now the slow part. I’ll probably optimize this in the future; it is the same code that needs to run every time I want to evaluate a game position using my neural network.
- There are a lot of compression algorithms available. The gzip family is readily available, and you can tune the balance between compression and speed.
- If you know the structure of your data, you can easily do a better and faster job of compressing than a generic compression algorithm.
All supporting code can be found on a git branch.