DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Join us tomorrow at 1 PM EST: "3-Step Approach to Comprehensive Runtime Application Security"
Save your seat
  1. DZone
  2. Data Engineering
  3. Data
  4. Modernizing Hadoop Ingest: Beyond Flume and Sqoop

Modernizing Hadoop Ingest: Beyond Flume and Sqoop

Look at easy-to-use StreamSets Data Collector for creating modern data flow ingestion pipelines for Hadoop.

Dash Desai user avatar by
Dash Desai
·
Nov. 16, 18 · Tutorial
Like (2)
Save
Tweet
Share
9.11K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Flume and Apache Sqoop were the tools of choice for ingesting data into Hadoop, but their development has slowed and more usable tools are now available. Flume's configuration file and Sqoop's command line pale beside modern tools for defining data flow pipelines. While these tools are great for smoothing out impedance mismatches and database dumps, they are complex to deploy and maintain, require custom coding/scripting, and are not enterprise-friendly.

In this post, we will look at easy-to-use StreamSets Data Collector for creating modern data flow ingestion pipelines for Hadoop. Data Collector not only addresses "configuration hell" and eliminates manual scripting, but also provides automated ways to handle "data drift"-where the schema and/or semantics of incoming data abruptly changes-without disrupting production workloads.

Let's consider this use case: purchase orders and transactions (stored across multiple tables) in an RDBMS need to be ingested into Hadoop for batch processing and further analysis.

Setup

Note: You can just easily have a similar setup for Hortonworks, MapR FS, and MS Azure writing data to ADLS.

Pipeline overview

Data Collector's JDBC Multitable Consumer origin is ideal for reading and ingesting multiple tables from one or more schemas in the same database, in a multi-threaded environment for parallel processing of tables and partitions. In this example, the origin is configured to read multiple tables from MySQL RDS on AWS.

The next component in the pipeline is Hive Metadata processor which generates metadata and record header attributes that Hadoop FS can use to process data. The Hive Metadata processor provides two output streams-one for writing data to Hadoop FS and one for writing metadata to Hive Metastore. Note that Hive Metadata, Hadoop FS and Hive Metastore are all configured to connect to CDH.

That's it! Using this pipeline you're able to ingest data into Hadoop without having to write any custom code/scripts.

But wait, there's more!

In Data Collector, the pipeline setup illustrated above is also capable of handling scenarios where schema changes at the origin. For example, a column is added to one or more tables in MySQL. In that case, the Hive Metadata processor will detect this columnar "drift" and, as a result, it will generate metadata records that describe the necessary changes and pass them down to the Hive Metastore. When the Hive Metastore receives a metadata record, it compares the proposed changes with the latest Hive metadata, and creates or updates Hive tables as needed. All of this functions seamlessly without any manual intervention or disruption to production workloads.

Note: The changes are reflected immediately in Hive, but Impala requires its metadata cache to be refreshed before changes to take effect. This metadata cache refresh task can be easily automated by extending the pipeline as shown below.

The additional steps include configuring Hadoop FS to produce events everytime it closes a file. This event can then be used to trigger a Hive Query executor task to invalidate and refresh Impala's metadata cache to reflect changes in the Hive metastore. For more details, refer to the documentation.

Pipeline Execution-Ingesting Into Hadoop

Initial ingestion showing purchase order transactions being read from MySQL and stored into Hadoop FS. Note that "Output" record count is higher than "Input" - that's because metadata records are being written to Hive Metastore in addition to orders being written to HDFS.

Query orders in Hive:
Query orders in Impala:

And this is the pipeline view after two new columns and 'longitude' are added to orders table and a new record is inserted.

Notice Input shows one (new order) record inserted and Output shows two including one for the schema change routed to Hive Metastore.

Query orders in Hive to confirm schema change:
Query orders in Impala to confirm schema change:

Summary

As illustrated, creating data ingestion pipelines that can gracefully handle "data drift" without having to write custom code/scripts is very easy and straightforward in Data Collector.

Resources

StreamSets Data Collector is open source, under the Apache v2 license. To download for free and start developing your dataflow pipelines, visit the Download page.

Database hadoop Data (computing) Sqoop Metadata Pipeline (software)

Published at DZone with permission of Dash Desai, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Explainer: Building High Performing Data Product Platform
  • Why It Is Important To Have an Ownership as a DevOps Engineer
  • Memory Debugging: A Deep Level of Insight
  • Apache Kafka vs. Memphis.dev

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: