Apache Avro to ORC Using Apache Gobblin
Data conversion from Apache Avro to Apache ORC using Apache Gobblin to optimize storage and query performance in big data infrastructures.
Join the DZone community and get the full member experience.Join For Free
Apache Avro and Apache ORC
Apache Avro is an efficient row-based binary file format for serializing data during transfer or at rest. It uses a schema to define the data structure that has to be serialized, and the schema is collocated and stored as part of Avro’s data file. As frequently needed in big data space, Avro was designed to support data evolution by allowing the augmentation of new fields to the data structure without the need for a complete recompilation of the code that uses it.
Whereas Apache ORC is a column-based storage format (primarily for Hadoop Distributed File System) that is optimized for storing and querying large datasets with heavy levels of compression. ORC is designed to improve the performance of query engines like Hive and Pig by providing a more efficient way to store and query data. ORC is more performant than Avro in processing speed and storage, especially for large datasets, as it keeps data in a more compact and columnar format. ORC also supports predicate pushdown, which allows it to filter data at the storage layer, reducing the amount of data that needs to be read from the disk and processed.
Primarily, Avro is a general-purpose data serialization format that is well-suited for transmitting data. Whereas ORC is a specialized data storage format optimized for storing, querying, and processing large datasets.
Therefore, ORC may be a better choice than Avro if you are working with large datasets and need to perform efficient queries and data processing. However, like most data infrastructures, your data lake might also comprise data predominantly stored in Avro format. It is predominately the case because Avro was founded in 2009 and cemented its footing with big data since its early days, whereas ORC was launched much later in 2013.
Challenges in Converting Data From Apache Avro to Apache ORC
Converting data from Avro to ORC is particularly challenging, and you might face issues like:
Schema conversion: Avro and ORC have their respective schema model. The schema conversion from Avro to ORC is time-consuming, more so if the schema is complex. Unfortunately, that is often the case with most big data datasets.
Data type differences: As with schema models, both Avro and ORC support distinct data types and do not map one on one. This typecasting further complicates the schema conversion.
Performance: Transforming data from Avro to ORC is often resource-intensive for large datasets. It can take excruciatingly long if not carefully crafted and heavily optimized.
Loss of data: Even if appropriately coded, data loss is possible during data conversion, primarily because of failures in intermediate tasks or incompatibility between Avro and ORC fields.
Using Apache Gobblin To Convert Apache Avro Data to Apache Orc
To overcome the challenges of Avro to ORC data conversion, Apache Gobblin can be put to use.
Apache Gobblin is an open-source data integration framework that simplifies the process of extracting, transforming, and loading large datasets into a target system. While we will discuss Gobblin’s usage in the context of Avro to ORC data conversion, Gobblin has a host of other capabilities and provides a range of built-in connectors and converters that can be used to transfer data between different formats and systems, including Avro and ORC.
Apache Gobblin effectively addresses the challenges in converting Avro data to ORC. It converts the schema, maps data types, and has specialized capabilities like flattening a nested column if needed. Gobblin also supports fault tolerance, data validation, and data quality checks, thereby ensuring the integrity of the data being translated. Furthermore, it is highly configurable and customizable to address specific requirements of any specialized scenario. We will cover out-of-box configurations and capabilities provided by Gobblin in this article.
To use Gobblin for converting Avro data to ORC:
- Start by registering Avro data with Apache Hive (partitioned or snapshot).
- Download the Gobblin binaries from here.
- Modify the configuration detailed below this section to configure your Gobblin job.
- Launch and run Gobblin in standalone mode using the documentation here.
For Step 3 above, refer to this example job and modify it per your requirements:
# Source Avro hive tables to convert hive.dataset.whitelist=avro_db_name.* # Configurations that instruct Gobblin to discover and convert the data (Do Not Change) source.class=org.apache.gobblin.data.management.conversion.hive.source.HiveAvroToOrcSource writer.builder.class=org.apache.gobblin.data.management.conversion.hive.writer.HiveQueryWriterBuilder converter.classes=org.apache.gobblin.data.management.conversion.hive.converter.HiveAvroToFlattenedOrcConverter,org.apache.gobblin.data.management.conversion.hive.converter.HiveAvroToNestedOrcConverter data.publisher.type=org.apache.gobblin.data.management.conversion.hive.publisher.HiveConvertPublisher hive.dataset.finder.class=org.apache.gobblin.data.management.conversion.hive.dataset.ConvertibleHiveDatasetFinder # Destination format and location hive.conversion.avro.destinationFormats=flattenedOrc hive.conversion.avro.flattenedOrc.destination.dataPath=/output_orc/ # Destination Hive table name (optionally with a postfix _orc) hive.conversion.avro.flattenedOrc.destination.tableName=$TABLE_orc # Destination Hive database name hive.conversion.avro.flattenedOrc.destination.dbName=$DB # Enable or disable schema evolution hive.conversion.avro.flattenedOrc.evolution.enabled=true # No host and port required. Hive starts an embedded hiveserver2 (Do Not Change) hiveserver.connection.string=jdbc:hive2:// # Maximum lookback hive.source.maximum.lookbackDays=3 ## Gobblin standard properties ## task.maxretries=1 taskexecutor.threadpool.size=75 workunit.retry.enabled=true # Gobblin framework locations mr.job.root.dir=/app/gobblin/working state.store.dir=/app/gobblin/state_store writer.staging.dir=/app/gobblin/writer_staging writer.output.dir=/app/gobblin/writer_output # Gobblin mode launcher.type=LOCAL classpath=lib/*
To understand the conversion process better, let us break this down:
hive.dataset.whitelist = avro_db_name.avro_table_name
The Avro data registered as a Hive dataset is specified in the
database.tableformat. You can use regex here.
source.class = org.apache.gobblin.data.management.conversion.hive.source.HiveAvroToOrcSource
The internal Java class that Gobblin uses to initiate and run the conversion.
Do not change.
The internal Java class that Gobblin uses to create Hive DDL and DML queries to convert data from Avro to ORC.
Do not change.
The internal Java classes that Gobblin uses to augment Hive DDL and DML queries with schema conversion to nested or flattened, depending on requirements.
Do not change.
The internal Java class that Gobblin uses to publish data from the intermediate location to the final location post successful conversion of data.
Do not change.
The internal Java class that Gobblin uses to find all partitions and the needed metadata about the Avro dataset to convert it to ORC.
Do not change.
Whether to convert Avro nested Records as-is or flatten it when converting to ORC. The alternate config option is nestedOrc
Output location to write ORC data to.
Output table name for the ORC data. $TABLE macro carries forward the Avro table name, you can pre or post-fix it with any string. Example _orc in the sample config.
Output database name for the ORC data. $DB macro carries forward the Avro database name; you can pre or post-fix it with any string.
Whether or not to evolve destination ORC schema if the source schema has evolved with new or updated column definitions.
Gobblin uses an embedded Hive engine for running its internally generated queries.
Do not change.
If Gobblin finds multiple partitions in the dataset, this config limits the maximum number of partitions from the past it picks up for conversion.
Standard Gobblin configs govern a maximum number of retries, thread pool size, and whether a retry of failed tasks is desired, respectively.
Standard Gobblin configs that Gobblin uses to store intermediate data, the state between recurring runs, and staged data, respectively.
Gobblin launcher type that governs how Gobblin runs. It includes Local (as in the example), MR, Yarn, and Cluster modes.
As the explanation above indicates, Apache Gobblin handles schema interconversion and compatibility through Apache Hive. Therefore, simplifying the conversion process via SerDes in Hive behind the scenes. Apache Gobblin further has special provisions to evolve the schema on the destination in recurring executions of partitioned data, supports flattening of nested Avro data if desired, and provides retry and staging of data before publishing for data integrity.
Opinions expressed by DZone contributors are their own.