DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Enhancing Avro With Semantic Metadata Using Logical Types
  • Data Storage Formats for Big Data Analytics: Performance and Cost Implications of Parquet, Avro, and ORC
  • Debezium Serialization With Apache Avro and Apicurio Service Registry
  • AI, ML, and Data Science: Shaping the Future of Automation

Trending

  • Unlocking AI Coding Assistants Part 1: Real-World Use Cases
  • How to Practice TDD With Kotlin
  • Emerging Data Architectures: The Future of Data Management
  • Understanding Java Signals
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Apache Avro to ORC Using Apache Gobblin

Apache Avro to ORC Using Apache Gobblin

Data conversion from Apache Avro to Apache ORC using Apache Gobblin to optimize storage and query performance in big data infrastructures.

By 
Abhishek Tiwari user avatar
Abhishek Tiwari
·
Jan. 10, 23 · Opinion
Likes (4)
Comment
Save
Tweet
Share
4.9K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Avro and Apache ORC 

Apache Avro and Apache ORC (Optimized Row Columnar) are top-level projects under the Apache Software Foundation. Fundamentally, they are data serialization formats with different strengths. 

Apache Avro is an efficient row-based binary file format for serializing data during transfer or at rest. It uses a schema to define the data structure that has to be serialized, and the schema is collocated and stored as part of Avro’s data file. As frequently needed in big data space, Avro was designed to support data evolution by allowing the augmentation of new fields to the data structure without the need for a complete recompilation of the code that uses it. 

Whereas Apache ORC is a column-based storage format (primarily for Hadoop Distributed File System) that is optimized for storing and querying large datasets with heavy levels of compression. ORC is designed to improve the performance of query engines like Hive and Pig by providing a more efficient way to store and query data. ORC is more performant than Avro in processing speed and storage, especially for large datasets, as it keeps data in a more compact and columnar format. ORC also supports predicate pushdown, which allows it to filter data at the storage layer, reducing the amount of data that needs to be read from the disk and processed.

Primarily, Avro is a general-purpose data serialization format that is well-suited for transmitting data. Whereas ORC is a specialized data storage format optimized for storing, querying, and processing large datasets. 

Therefore, ORC may be a better choice than Avro if you are working with large datasets and need to perform efficient queries and data processing. However, like most data infrastructures, your data lake might also comprise data predominantly stored in Avro format. It is predominately the case because Avro was founded in 2009 and cemented its footing with big data since its early days, whereas ORC was launched much later in 2013. 

Challenges in Converting Data From Apache Avro to Apache ORC

Converting data from Avro to ORC is particularly challenging, and you might face issues like:

  1. Schema conversion: Avro and ORC have their respective schema model. The schema conversion from Avro to ORC is time-consuming, more so if the schema is complex. Unfortunately, that is often the case with most big data datasets.  

  2. Data type differences: As with schema models, both Avro and ORC support distinct data types and do not map one on one. This typecasting further complicates the schema conversion. 

  3. Performance: Transforming data from Avro to ORC is often resource-intensive for large datasets. It can take excruciatingly long if not carefully crafted and heavily optimized. 

  4. Loss of data: Even if appropriately coded, data loss is possible during data conversion, primarily because of failures in intermediate tasks or incompatibility between Avro and ORC fields. 

Using Apache Gobblin To Convert Apache Avro Data to Apache Orc

To overcome the challenges of Avro to ORC data conversion, Apache Gobblin can be put to use.

Apache Gobblin is an open-source data integration framework that simplifies the process of extracting, transforming, and loading large datasets into a target system. While we will discuss Gobblin’s usage in the context of Avro to ORC data conversion, Gobblin has a host of other capabilities and provides a range of built-in connectors and converters that can be used to transfer data between different formats and systems, including Avro and ORC.

Apache Gobblin effectively addresses the challenges in converting Avro data to ORC. It converts the schema, maps data types, and has specialized capabilities like flattening a nested column if needed. Gobblin also supports fault tolerance, data validation, and data quality checks, thereby ensuring the integrity of the data being translated. Furthermore, it is highly configurable and customizable to address specific requirements of any specialized scenario. We will cover out-of-box configurations and capabilities provided by Gobblin in this article. 

To use Gobblin for converting Avro data to ORC:

  1. Start by registering Avro data with Apache Hive (partitioned or snapshot). 
  2. Download the Gobblin binaries from here. 
  3. Modify the configuration detailed below this section to configure your Gobblin job. 
  4. Launch and run Gobblin in standalone mode using the documentation here. 

For Step 3 above, refer to this example job and modify it per your requirements: 

Plain Text
 
# Source Avro hive tables to convert
hive.dataset.whitelist=avro_db_name.*

# Configurations that instruct Gobblin to discover and convert the data (Do Not Change)
source.class=org.apache.gobblin.data.management.conversion.hive.source.HiveAvroToOrcSource
writer.builder.class=org.apache.gobblin.data.management.conversion.hive.writer.HiveQueryWriterBuilder
converter.classes=org.apache.gobblin.data.management.conversion.hive.converter.HiveAvroToFlattenedOrcConverter,org.apache.gobblin.data.management.conversion.hive.converter.HiveAvroToNestedOrcConverter
data.publisher.type=org.apache.gobblin.data.management.conversion.hive.publisher.HiveConvertPublisher
hive.dataset.finder.class=org.apache.gobblin.data.management.conversion.hive.dataset.ConvertibleHiveDatasetFinder

# Destination format and location
hive.conversion.avro.destinationFormats=flattenedOrc
hive.conversion.avro.flattenedOrc.destination.dataPath=/output_orc/

# Destination Hive table name (optionally with a postfix _orc)
hive.conversion.avro.flattenedOrc.destination.tableName=$TABLE_orc

# Destination Hive database name
hive.conversion.avro.flattenedOrc.destination.dbName=$DB

# Enable or disable schema evolution
hive.conversion.avro.flattenedOrc.evolution.enabled=true

# No host and port required. Hive starts an embedded hiveserver2 (Do Not Change)
hiveserver.connection.string=jdbc:hive2://

# Maximum lookback
hive.source.maximum.lookbackDays=3

## Gobblin standard properties ##
task.maxretries=1
taskexecutor.threadpool.size=75
workunit.retry.enabled=true

# Gobblin framework locations
mr.job.root.dir=/app/gobblin/working
state.store.dir=/app/gobblin/state_store
writer.staging.dir=/app/gobblin/writer_staging
writer.output.dir=/app/gobblin/writer_output

# Gobblin mode
launcher.type=LOCAL
classpath=lib/*


To understand the conversion process better, let us break this down: 

  • hive.dataset.whitelist = avro_db_name.avro_table_name
    The Avro data registered as a Hive dataset is specified in the database.table format. You can use regex here.
  • source.class = org.apache.gobblin.data.management.conversion.hive.source.HiveAvroToOrcSource 
    The internal Java class that Gobblin uses to initiate and run the conversion.
    Do not change.
  • writer.builder.class=org.apache.gobblin.data.management.conversion.hive.writer.HiveQueryWriterBuilder
    The internal Java class that Gobblin uses to create Hive DDL and DML queries to convert data from Avro to ORC.
    Do not change.  
  • converter.classes=org.apache.gobblin.data.management.conversion.hive.converter.HiveAvroToFlattenedOrcConverter,org.apache.gobblin.data.management.conversion.hive.converter.HiveAvroToNestedOrcConverter 
    The internal Java classes that Gobblin uses to augment Hive DDL and DML queries with schema conversion to nested or flattened, depending on requirements.
    Do not change. 
  • data.publisher.type=org.apache.gobblin.data.management.conversion.hive.publisher.HiveConvertPublisher
    The internal Java class that Gobblin uses to publish data from the intermediate location to the final location post successful conversion of data.
    Do not change. 
  • hive.dataset.finder.class=org.apache.gobblin.data.management.conversion.hive.dataset.ConvertibleHiveDatasetFinder
    The internal Java class that Gobblin uses to find all partitions and the needed metadata about the Avro dataset to convert it to ORC.
    Do not change. 
  • hive.conversion.avro.destinationFormats=flattenedOrc
    Whether to convert Avro nested Records as-is or flatten it when converting to ORC. The alternate config option is nestedOrc
  • hive.conversion.avro.flattenedOrc.destination.dataPath=/output_orc/
    Output location to write ORC data to.
  • hive.conversion.avro.flattenedOrc.destination.tableName=$TABLE_orc
    Output table name for the ORC data. $TABLE macro carries forward the Avro table name, you can pre or post-fix it with any string. Example _orc in the sample config.
  • hive.conversion.avro.flattenedOrc.destination.dbName=$DB
    Output database name for the ORC data. $DB macro carries forward the Avro database name; you can pre or post-fix it with any string.
  • hive.conversion.avro.flattenedOrc.evolution.enabled=true
    Whether or not to evolve destination ORC schema if the source schema has evolved with new or updated column definitions.
  • hiveserver.connection.string=jdbc:hive2://
    Gobblin uses an embedded Hive engine for running its internally generated queries.
    Do not change.
  • hive.source.maximum.lookbackDays=5
    If Gobblin finds multiple partitions in the dataset, this config limits the maximum number of partitions from the past it picks up for conversion.
  • task.maxretries=1 
    taskexecutor.threadpool.size=75 
    workunit.retry.enabled=true
    Standard Gobblin configs govern a maximum number of retries, thread pool size, and whether a retry of failed tasks is desired, respectively.
  • mr.job.root.dir=/app/gobblin/working 
    state.store.dir=/app/gobblin/state_store 
    writer.staging.dir=/app/gobblin/writer_staging 
    writer.output.dir=/app/gobblin/writer_output
    Standard Gobblin configs that Gobblin uses to store intermediate data, the state between recurring runs, and staged data, respectively.
  • launcher.type=LOCAL 
    classpath=lib/*
    Gobblin launcher type that governs how Gobblin runs. It includes Local (as in the example), MR, Yarn, and Cluster modes.

As the explanation above indicates, Apache Gobblin handles schema interconversion and compatibility through Apache Hive. Therefore, simplifying the conversion process via SerDes in Hive behind the scenes. Apache Gobblin further has special provisions to evolve the schema on the destination in recurring executions of partitioned data, supports flattening of nested Avro data if desired, and provides retry and staging of data before publishing for data integrity.

Apache ORC Big data avro Orc (programming language) Schema

Opinions expressed by DZone contributors are their own.

Related

  • Enhancing Avro With Semantic Metadata Using Logical Types
  • Data Storage Formats for Big Data Analytics: Performance and Cost Implications of Parquet, Avro, and ORC
  • Debezium Serialization With Apache Avro and Apicurio Service Registry
  • AI, ML, and Data Science: Shaping the Future of Automation

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!