Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Compress Data by 90 Percent

DZone's Guide to

How to Compress Data by 90 Percent

Do your files require substantial server memory? Here is how you can compress your data by 90 percent with data compression algorithms.

· Performance Zone ·
Free Resource

Sensu is an open source monitoring event pipeline. Try it today.

Data Compression Algorithms

There are two main types of data compression:

  1. Lossless compression (fully reversible)

  2. Lossy compression (a minor part of data is lost and complete restoration is not possible)

The first type of compression is employed when it is important to ensure that the compressed data restored is distortion-free, like in our project. This kind of compression removes none of the original data. It is only due to less space-consuming data representation that compression is achieved.

Let us consider some of the most common lossless (reversible) compression algorithms:

  1. Huffman technique — This entails replacing an equal-length code for symbols with unequal-length codes, depending on the frequency of occurrence for a symbol in the text. The codes with minimum length are assigned to the most common symbols. Where character frequencies are a power of two (2n), this technique theoretically approaches the entropy limit of compression efficiency for these type of techniques.

  2. Shannon–Fano technique — This is a prefix, non-uniform coding algorithm. It represents the compression techniques based on probabilities. Like the Huffman algorithm, this technique builds on message redundancy, which means a non-uniform distribution of frequencies for the symbols of its (primary) alphabet, i.e., replaces the codes of more frequent symbols with short binary sequences and codes of more rare symbols with longer binary sequences. With that in mind, the Shannon technique is notably worse than the Huffman code.

  3. Running (RLE) technique — This refers to replacing the sets of repeated symbols with the symbol code and the number of repetitions. Though easy-to-understand and very simple, this compression technique still lacks efficiency.

  4. Lempel–Ziv–Welch techniques — What all compression algorithms of this group (LZ78, LZ77, and LZW) have in common is the idea of searching for text fragment repetitions in data and replacing these repetitions with a reference to the first or previous occurrence of this fragment in data.

The Huffman coding problem is equivalent to the problem of construction of the associated tree.

Huffman Algorithm

The algorithm of constructing optimal unequal codes proposed by Huffman is one of the most important achievements of information theory from both theoretical and applied perspectives.

Let us consider the set of messages Х = {1,…,М} with message probabilities {p1, …,pm}. Without the loss of generality, we view messages as enqueued by probability in decreasing order, i.e., p1 < p2 < … < pm. Our objective is to construct an optimal code, which is a code with the lowest possible average length of keywords. It is clear that this code is not necessarily the only one for given probabilities. In contrast, a family of optimal codes can exist. Let us find some of the properties of all codes representing this family. Those properties will give us a clue as to a simple way of determining an optimal code.

Let the binary code С = {c1, …, cm} with codeword lengths {l1, …, lM} be optimal for the set of messages in question.

  1. If pi < pj, then li > lj.

  2. The length lM = maxm1m of at least two codewords is optimal.

  3. Two of the codewords whose length is lM = maxmlm will differ only in one — the last — symbol.

  4. If the code С’ is optimal for X’, then the code С is optimal for X.

Input: Alphabet size M, the probabilities of letters

Output: Binary tree of the Huffman code

Initialization: The number of unprocessed nodes is М0=М

While M0>1 do

  1. Find two nodes with the lowest probabilities in the queue of unprocessed nodes.

  2. Remove those nodes from the queue of unprocessed ones.

  3. Generate a new node with the two just-selected nodes as children. That said, the weight of the resulting node is set equal to the sum of child nodes’ weights.

  4. Add the newly generated node to the queue. Link the new node with edges to the just-removed nodes.

  5. М0 <– М <– 1.

  6. If there is more than one node in the queue, repeat steps 2–5.

Sensu: workflow automation for monitoring. Learn more—download the whitepaper.

Topics:
data compression ,blockchain ,huffman ,algorithms ,security

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}