Compressing Your Big Data: Tips and Tricks
Lossless compression for the win.
Join the DZone community and get the full member experience.Join For Free
The growth of big data has created a demand for ever-increasing processing power and efficient storage. DigitalGlobe’s databases, for example, expand by roughly 100TBs a day and cost an estimated $500K a month to store.
Compressing big data can help address these demands by reducing the amount of storage and bandwidth required for data sets. Compression can also remove irrelevant or redundant data, making analysis and processing easier and faster.
Tips and Considerations for Big Data Compression
To maximize the value of your data, you need to be able to maximize your storage and processing resources and minimize your costs.
Consider Adding a Coprocessor
When you compress data, you must use computing resources and time that could be used for analytics or processing. If your resources are tied up in compression, your productivity will drop until the compression is complete. To avoid this loss of power and time, consider adding coprocessors to your system.
Field-Programmable Gate Arrays (FPGAs) are microchips that can be custom configured as additional processors for your machines. You can use FPGAs to accelerate your hardware and share computation responsibilities with your primary Central Processing Units (CPUs)
You can dedicate FPGA processing power to compressing data and queue compression jobs to these chips. By queueing jobs, you eliminate the need to wait for resources to become available for compression. Since your primary CPUs are no longer being monopolized by data compression, you can continue analysis and processing without waiting.
Weigh Your Compression Types
When compressing your data, you can choose between lossless or lossy methods. Lossless compression preserves all data by replacing duplicated data with variables or references to the first instance of the data in a file.
Lossy compression eliminates data, creating a rough approximation of what the data originally was. Lossless compression is typically used for databases, text documents, and other discrete data. Lossy compression is typically used for images, audio, or video.
While lossless compression ensures that all data can be retrieved at decompression, it also takes up more storage space than lossy. To accommodate this difference, you can consider using both methods of compression, depending on your data type.
For example, you can compress video or image sets for machine learning that don’t require high resolution. If you are using a content management system for your sets, there are often features included for automatically compressing video size or optimizing images.
Select Your Codec Carefully
Codec is short for compressor/decompressor. It refers to the software, hardware, or a combination of the two. You use codecs to apply compression/decompression algorithms to data.
The type of codec that you are able to use depends on the data and file type you are trying to compress. It also depends on whether you need your compressed file to be splittable. Splittable files can be processed in parallel by different processors. The following codecs can be useful for compressing big data:
- gzip — provides lossless compression that is not splittable. It is often used for HTTP compression. Gzip compression ratio is around 2.7x-3x. Compression speed is between 100MB/s and decompression speed is around 440MB/s.
- Snappy — provides lossless compression that is not splittable. It is integrated into Hadoop Common and often used for database compression. Snappy compression ratio is around 2x. Compression speed is around 580MB/s and decompression is around 2020 MB/s.
- LZ4 — provides lossless compression that is not splittable unless combined with the 4MC library. It is used for general-purpose analysis. LZ4 compression ratio is around 2.5x. Compression speed is around 800MB/s and decompression speed is around 4220MB/s.
- Zstd — provides lossless compression that is splittable. It is not data type-specific and is designed for real-time compression. Zstd compression ratio is around 2.8x. Compression speed is around 530MB/s and decompression is around 1360MB/s.
Optimize JSON performance
Unfortunately, working with JSON files in big data tools, like Hadoop, can be slow because it is neither schema-ed nor strongly typed. To solve this issue, you can optimize JSON performance by storing files in Parquet or Avro formats.
Parquet is a column-based format that is compressible and splittable. Parquet stores data in binary files with metadata. This structure enables tools, like Spark, to determine column names, data type, and compression and encodings without parsing the file. The order that Parquet stores metadata in enables fast, one-pass writing.
Parquet is most useful if you only need to access specific fields. Unfortunately, this format cannot be created from streaming data. It is often used for read-heavy workloads and complex analysis.
Avro is a row-based format that is splittable and compressible. Avro files have schema stored in a JSON format and data stored in a binary format. This structure reduces file size and maximizes efficiency.
Avro can be created from streaming data and is most useful if all fields in a dataset need to be accessed. It is typically used for write-heavy workloads because new rows can be added simply and quickly.
Compression of big data is becoming key to maintaining costs and productivity for many businesses. Thankfully, new technologies and algorithms are being researched and created to address this need.
Hopefully, the compression methods and optimization strategies covered here can help you manage your data until better options become available. By taking advantage of the tools currently available, you should be able to gain a competitive edge while reducing your costs.
Opinions expressed by DZone contributors are their own.