Over a million developers have joined DZone.

Avro at Data Sourcing Layer and Columnar Format for High Performance

DZone's Guide to

Avro at Data Sourcing Layer and Columnar Format for High Performance

If you are starting to set up data lake architecture, this article will be helpful to you. Get insight into why multi-layers need to be built at the data lake.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In my current project, to lay down data lake architecture, we chose Avro format tables as the first layer of data consumption and query tables. As Avro is the only format as of now that supports maximum schema evolution, it made sense for me to recommend a format that can be modified as the source modifies its schema. Below is a small description on what level of schema evolution is supported in Avro format.

  • Avro supports addition and deletion of columns at any position.

  • Avro supports changing column names via changing the aliases.

  • Avro does not support data type changes.

But to get the advantage of Avro schema evolution, we need to take care of a few points while creating our Avro table schema (AVSC). Here is a good write-up on rules to follow to take advantage of Avro schema evolution.

The other advantage of Avro format is that it is platform-independent and carries the schema of the data in the file. Therefore, data can be exported to other platforms/applications where the Avro de-serialization utility is present. Avro support for schema on read makes it more special and easy to be used by various users (internal/external).

The other good reason to use Avro at the sourcing layer is that it supports various data structures like map, array, and other basic data structures; therefore, it supports the ingestion of plain text to complex XML ingestions.

Avro in compressed form gives almost 60-80% compression, so query performance on Avro tables is faster than plain text format.

Avro is best suited for storing de-normalized data, continuous changes in the schema, and data with a longer retention policy.

After the data sourcing layer to the data lake, we recommended creating an ORC layer for high-performing queries. But many questions came up: Why ORC layer? Why did we not implement an ORC layer at the data sourcing layer? Why we need an Avro layer at all?

Here are my answers on choosing data storage formats.

  • Why we need Avro layer: If the data sourcing layer to the data lake is not able to support schema evolution, accommodating source schema changes to the data lake will be very complex and time-consuming. Many times, it ends up creating multiple versions of tables. Handling and tracking such changes will not be easy task.

  • Query performance: When it comes to query performance on big data (i.e. hundreds of GB) in Avro format, Avro performs better the plain text tables, but columnar formats like ORC or Parquet perform much better than Avro in such cases.

  • Why ORC is not at data sourcing layer: To create ORC format tables that are in columnar format, plain text format with defined schema is required. So, ORC at the data sourcing layer is not possible.

  • Why ORC: While ORC is recommended by Hortonworks for high-performing queries, there are other advantages of ORC, like support for schema on read and compression. ORC also supports faster joins than row format tables like Avro, plain text, JSON, etc. ORC carries data stats and does column indexing, as well.

  • Schema evolution at next layer: When Avro supports schema evolution at the data sourcing layer, how do we propagate same changes to the next layer or ORC layer? This is actually a more logical question than asking which schema supports what. ORC or any other format supports schema evolution (adding new columns) by adding the column at the end of the schema. If there is a need to delete the column, either leave the column value as NULL or export the existing data to the new table leaving the "to be deleted" column. In short, until the time we have data sourced as required at the first layer, we can build logic to propagate the changes to next layer easily.

  • Querying subset of columns: Normally, queries include only a few columns. ORC is the best-suited format (if running a HortonWorks distribution) when only a subset of data has to be queried and not all of the data will be loaded to memory for processing.

  • ORC as schema on read: Like Avro, ORC supports schema on read and ORC data files contain data schemas, along with data stats.

If you are starting to set up data lake architecture, this article was hopefully helpful to give insight into why multi-layers need to be built at the data lake.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

avro ,queries ,big data ,orc ,data lake ,data performance

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}