DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Building Scalable Real-Time Apps with AstraDB and Vaadin
Register Now

Trending

  • Structured Logging
  • What Is mTLS? How To Implement It With Istio
  • Web Development Checklist
  • Seven Steps To Deploy Kedro Pipelines on Amazon EMR

Trending

  • Structured Logging
  • What Is mTLS? How To Implement It With Istio
  • Web Development Checklist
  • Seven Steps To Deploy Kedro Pipelines on Amazon EMR
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Optimal Compression Rate

Optimal Compression Rate

Squeeze out more performance by using the best compression methods for the task.

Oren Eini user avatar by
Oren Eini
·
Mar. 01, 16 · Analysis
Like (1)
Save
Tweet
Share
2.34K Views

Join the DZone community and get the full member experience.

Join For Free

In the new blittable format we use in RavenDB, we make heavy use of compression to reduce the size of documents.

Compression has an unfortunate problem, though. It doesn't work for all inputs. The proof of that is a very neat one:

  • We have a set of compression / decompression functions: compress(plainText) –> compressedText and decompress(compressedText) –> plainText.
  • Those are lossless functions, that is, for any input decompress( compress(X) ) == X
  • Let us assume that for any input, size(plainText) > size(compressedText)

But that causes a problem. Let us assume that our plain text size is N and that the compression algorithm reduces that size by just one bit, so the size of the compressedText is N-1.

We'll compress all possible permutations of N bits using this algorithm. Given that the compression results in at least N-1  bits, there must now be two different values of the plainText that result in the same compressedText. That breaks the ability to decompress them successfully. Because there isn't a one to one mapping between them. Common compression algorithms rely on the source data to either have repetitions (LZ derivatives) or are based on shared dictionaries that match a particular set of data.

In practice, this has real world implications when you are designing a data format. For example, the blittable format compresses strings using two different algorithms. For large strings, we use LZ4, because it has much higher compression rate and doesn't require any special knowledge of the data. For small strings, we use Smaz variants, which is a shared dictionary of common terms. Because the dictionary is small, we can't put a lot of data into it, so we concentrated on common Latin character sequences.

That means that if you are trying to compress a Unicode string like:

רוח צפונית

You are going to use up more bytes than the original plain text. This is easy to experiment with using Smaz variant because it is very simple. But it also happens using LZ4 for certain inputs.

That causes a problem for the blittable format because we want to compress the data, but for certain inputs, that means that we are going to get more data.

We solved that by doing conditional compression. We designate a buffer that is smaller than the plain text that we are compressing (this reflect the maximum amount of compression that is valuable for us) and compresses to that buffer. If the compression routine was unable to compress to that buffer (because it needed more space), we fail the compression, and just store the plain text.

Now we have an optimal compression rate, this is going to always be equal to or (hopefully usually) smaller than the original text.

Data (computing) Plain text Relational database Strings Algorithm Dictionary (software) Buffer (application) LZ4 (compression algorithm)

Opinions expressed by DZone contributors are their own.

Trending

  • Structured Logging
  • What Is mTLS? How To Implement It With Istio
  • Web Development Checklist
  • Seven Steps To Deploy Kedro Pipelines on Amazon EMR

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: