The first major version of the scalable timeseries database I work on, Gnocchi, was a released a few months ago. In this first iteration, it took a rather naive approach to data storage. We had little ideas about if and how our distributed back-ends were going to be heavily used, so we stuck to the code of the first proof-of-concept written a couple of years ago.
Recently we got more feedbacks from our users, ran a few benchmarks. That gave us enough feedback to start investigating improving our storage strategy.
Up to Gnocchi 1.3, all data for a single metric are stored in a single gigantic file per aggregation method (min, max, average…). This means that the file can grow to several megabytes in size, which makes it slow to manipulate.For the next version of Gnocchi, our first work has been to rework that storage and split the data into smaller parts.
The diagram above shows how data are organized inside Gnocchi. Until version 1.3, there would have been only one file for each aggregation methods.
In the upcoming 2.0 version, Gnocchi will split all these data into smaller parts, where each data split is stored in a file/object. This allows us to manipulate smaller pieces of data and to increase the parallelism of the CRUDoperations on the back-end – leading to large speed improvement.
In order to split timeseries into several chunks, Gnocchi defines a maximum number of N points to keep per chunk, to limit their maximum size. It then defines a hash function that produces a non-unique key for any timestamp. It makes it easy to find in which chunk any timestamp should be stored or retrieved.
Up to Gnocchi 1.3, the data stored for each metric is simply serialized using msgpack, a fast and small serialization format. However, this format does not provide any compression. That means that storing datapoints needs 8 bytes for a timestamp (64 bits timestamp with nanosecond precision) and 8 bytes for a value (64 bits double-precision floating-point), plus some overhead (extra information and msgpack itself).
After looking around on how to compress all these measures, I stumbled upon a paper from some Facebook engineers about Gorilla, their in-memory timeseries database, entitled "Gorilla: A Fast, Scalable, In-Memory Time Series Database". For reference, part of this encoding is also used by InfluxDB in its new storage engine.
The first technique I implemented is easy enough and is inspired from delta-of-delta encoding. Instead of storing each timestamp for each data point, and since all the data points are aggregated on a regular interval, we transpose points to be the time difference divided by the interval. For example, the suite of timestamps
timestamps =[41230, 41235, 41240, 41250, 41255] is encoded into
timestamps =[41230, 1, 1, 2, 1], interval = 5. This allows regular compression algorithms to reduce the size of the integer list using run-length encoding.
To actually compress the values, I tried two different algorithms:
- LZ4, a fast compression/decompression algorithm
- The XOR based compression scheme described in the Gorilla paper mentioned above – that I had to implement myself. For reference, it also exists a Go implementation in go-tsz.
I then benchmarked these solutions:
The XOR algorithm implemented in Python is pretty slow, compared to LZ4. Truth is that python-lz4 is fully implemented in C, which makes it fast. I've profiled my XOR implementation in Python, to discover that one operation took 20% of the time:
count_lead_and_trail_zeroes, which is in charge of counting the number of leading and trailing zeroes in a binary number.
I tried 2 Python implementations of the same algorithm (and submitted them to my friend and Python developer Victor Stinner by the way).
The first version, using string search with
.index() is 10× faster than the second one that only does integer computation. Ah, Python… As Victor explained, each Python operation is slow and there's a lot in the second version, whereas
.index() is implemented in C and really well optimized and only needs 2 Python operations.
Finally, I ended up optimizing that code by leveraging cffi to use directly
flsll(). That decreased the run-time of
count_lead_and_trail_zeroes by 45 %, making the entire XOR compression code speed increased by a small 7%. This is not enough to catch up with LZ4 speed. At this stage, the only solution to achieve a high-speed would probably to go with a full C implementation.
Considering the compression ratio of the different algorithms, they are pretty much identical. The worst case scenario (random values) for LZ4 compress down to 9 bytes per data point, whereas XOR can go down to 7.38 bytes per datapoint. In general XOR encoding beats LZ4 by 15%, except for cases where all values are 0 or 1. However, LZ4 is faster than XOR by a factor of 4×-70×depending on cases.
That means that we'll use LZ4 for data compression in Gnocchi 2.0. It's possible that we could achieve as fast compression/decompression algorithm, but I don't think it's worth the effort right now – it'd represent a lot of code to write and to maintain.