Over a million developers have joined DZone.

Optimal Compression Rate

Squeeze out more performance by using the best compression methods for the task.

· Performance Zone

Download Forrester’s “Vendor Landscape, Application Performance Management” report that examines the evolving role of APM as a key driver of customer satisfaction and business success, brought to you in partnership with BMC.

In the new blittable format we use in RavenDB, we make heavy use of compression to reduce the size of documents.

Compression has an unfortunate problem, though. It doesn't work for all inputs. The proof of that is a very neat one:

  • We have a set of compression / decompression functions: compress(plainText) –> compressedText and decompress(compressedText) –> plainText.
  • Those are lossless functions, that is, for any input decompress( compress(X) ) == X
  • Let us assume that for any input, size(plainText) > size(compressedText)

But that causes a problem. Let us assume that our plain text size is N and that the compression algorithm reduces that size by just one bit, so the size of the compressedText is N-1.

We'll compress all possible permutations of N bits using this algorithm. Given that the compression results in at least N-1  bits, there must now be two different values of the plainText that result in the same compressedText. That breaks the ability to decompress them successfully. Because there isn't a one to one mapping between them. Common compression algorithms rely on the source data to either have repetitions (LZ derivatives) or are based on shared dictionaries that match a particular set of data.

In practice, this has real world implications when you are designing a data format. For example, the blittable format compresses strings using two different algorithms. For large strings, we use LZ4, because it has much higher compression rate and doesn't require any special knowledge of the data. For small strings, we use Smaz variants, which is a shared dictionary of common terms. Because the dictionary is small, we can't put a lot of data into it, so we concentrated on common Latin character sequences.

That means that if you are trying to compress a Unicode string like:

רוח צפונית

You are going to use up more bytes than the original plain text. This is easy to experiment with using Smaz variant because it is very simple. But it also happens using LZ4 for certain inputs.

That causes a problem for the blittable format because we want to compress the data, but for certain inputs, that means that we are going to get more data.

We solved that by doing conditional compression. We designate a buffer that is smaller than the plain text that we are compressing (this reflect the maximum amount of compression that is valuable for us) and compresses to that buffer. If the compression routine was unable to compress to that buffer (because it needed more space), we fail the compression, and just store the plain text.

Now we have an optimal compression rate, this is going to always be equal to or (hopefully usually) smaller than the original text.

See Forrester’s Report, “Vendor Landscape, Application Performance Management” to identify the right vendor to help IT deliver better service at a lower cost, brought to you in partnership with BMC.

compression,document store,string handling

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}