Optimized File Formats – Reduce Overall System Latency
Decrease system latency in a few, simple steps.
Join the DZone community and get the full member experience.
Join For FreeSince Optimized columnar file formats helped Big data ecosystem to have SQL query features, Organizations are now able to retrain their existing data warehouse or Database developers quickly in Big data technology and migrate their analytics applications to on-premise Hadoop clusters or cheap object storage in the cloud.
When Columnar file formats were first proposed in the early 2010s, the intention was to enable faster query execution engines on top of the Hadoop file system. The columnar format was explicitly designed to give much-improved query performance than conventional row-based file formats. Columnar file formats give much better performance than row-based file formats (used in conventional Databases and data warehouses) when a partial set of columns from a table are queried.
Open source developer community and organizations have been contributing continuously on Columnar file formats, improving the encoding and compression of columnar file formats. Besides the query speed, data file sizes are reduced significantly in columnar format. All major programming languages such as Java, Python, C, C++, or Go are supported in these columnar file format open-source projects.
Thanks to recent enhancements, the Columnar file format can be used in many places, not just in Hadoop clusters or object storage-based cloud data lakes. Especially in the Industrial manufacturing world, Columnar file formats can give a number of benefits – reduced system latency, less system cost, and Data security.
You may also like: XML Optimization for the Highest Performance.
Diagrams given below show the data flow from the edge (Production floor at Factory) to the cloud.
In the Data Flow (1), data is streamed directly to the cloud without much data preparation. Raw data is consolidated in Cloud, and ETL tools in the cloud process this raw data streams and structure data for further operational analytics need. In this reference architecture, Overall system latency is mostly the sum of three Values: time to transmit data from cloudlet to Cloud data preparation layer, time to prepare raw data in the cloud in a big batch, and time to replicate prepared data into a data lake.
In the Data Flow (2), data is processed at the cloudlet (popularly known as a cloud in the box) in small batches in almost real-time and converted into a columnar file format. Since the data is prepared already, Data can be directly inserted into Data lake. Since significant time is saved in data preparation, data is available for further consumption in close to real-time.
I created two integrations using the above-described data flow patterns and results clearly show the benefits of using columnar formats in the Edge or cloudlet layer. In this test, a 1 GB CSV file was transferred to the cloud data lake, and data clean up complexity was not high. Though it was a limited test, it shows compelling benefits of Data flow(2) over Data flow (1).
Data Flow (1) |
Data Flow (2) |
|
Time to prepare data at Cloudlet [a] |
Not Applicable |
0.87 |
Time to transfer data from cloudlet to cloud [b] |
20.63 Mins |
5.29 Mins |
Time to prepare data at Cloud [c] |
1.305 Mins |
Not Applicable |
Overall System Latency [a+b+c] |
21.935 Mins |
6.16 Mins |
Storage space at Cloud (GBs) |
1.256 GB |
0.25 GB |
Key Benefits:
Speed: Less data preparation and smaller file sizes help to improve the overall speed of data replication. Most importantly, the ability to scale is multiplied without any additional cost.
Cost of data transfer: Smaller file sizes (almost 4 times lesser size than the conventional row-based format) reduces the cost and time to transfer the data from on-premise to cloud data lake over the internet.
Cost of Storage: Since the data preparation is almost removed from the cloud (as per Data Flow(2)) and data file sizes got reduced, the Cost of cloud storage would be reduced significantly.
Data security: In the case of columnar formats (Data Flow(2)), encryption is available from the data source to data lake to ensure data security.
The columnar data format can be leveraged as described above to create cost-effective, close to real-time operational analytics solutions for the industrial world.
Further Reading
Opinions expressed by DZone contributors are their own.
Trending
-
Reactive Programming
-
SRE vs. DevOps
-
Revolutionizing Algorithmic Trading: The Power of Reinforcement Learning
-
How To Approach Java, Databases, and SQL [Video]
Comments