Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Bringing Relational Data Into Data Lakes

DZone's Guide to

Bringing Relational Data Into Data Lakes

Data lakes are piles of data with various kinds (and lack of) structure. How can we bring relational data into these for clients? Here, step through a tutorial that shows you how to make it happen.

· Big Data Zone
Free Resource

Free O'Reilly eBook: Learn how to architect always-on apps that scale. Brought to you by Mesosphere DC/OS–the premier platform for containers and big data.

Data lakes comprise of unstructured, semi-structured and structured data. A common practice for a lot of companies is to bring in data from relational data stores so that various consumers can use it. In this blog post, we will describe how to easily and quickly build a data pipeline with Cask Hydrator. We will take a look at getting data from a database source and storing it in time partitioned files, and then enable SQL access on top of the data.

Cask Hydrator, a 100% open source extension of the Cask Data Application Platform (CDAP), allows building data pipelines through a simple drag and drop user interface. Here is an earlier blog post introducing Hydrator.

Let’s consider a scenario where an e-commerce company is storing registration information in a MySQL database. The company wishes to take a daily dump of registration data and store it in the Hadoop file system in a way that can easily be addressed by time and explored in an ad-hoc manner using SQL queries.

Let’s see how to build a data pipeline to ingest this data. Hydrator Studio view is a visual editor for creating a logical data pipeline.

image01

The Source

The first step in building the data pipeline is configuring the source. Hydrator supports Database sources out-of-the-box using the Database plugin.

Loading JDBC drivers

The Database Source plugin uses JDBC drivers to talk to the database and bring the data in. For this purpose, you need to deploy the appropriate JDBC driver into Hydrator. This particular use case deals with loading the data from a MySQL database, and the steps below highlight loading the database driver.

Step 1: Download the appropriate version of the JDBC driver JAR. We will use MySQL driver v5.1.38.
Step 2: Load the JDBC driver using CDAP REST APIs.

curl -X POST <hostname>:10000/v3/namespaces/default/artifacts/mysql-connector-java \
  -H "Artifact-Plugins: [ { 'name': 'mysql', 'type': 'jdbc', 'className': 'com.mysql.jdbc.Driver' } ]" \
  -H "Artifact-Version: 5.1.38" \
  -H "Artifact-Extends: system:cdap-etl-batch[3.4.0, 4.0.0]/system:cdap-data-pipeline[3.4.1, 4.0.0]" \
  --data-binary @mysql-connector-java-5.1.38.jar

Configuring the Source

To configure the source choose the Database source and provide the required parameters:

  • The username and password to connect to the database.
  • The JDBC Plugin name to be used for the database connection. That is, the name we used in Step 2, namely “mysql”.
  • The JDBC connection string
  • The SELECT query used for importing the data

You can optionally set the number of splits to 1 to fetch all the data in a single map task.

image02

Configuring the Destination

The next step in building this pipeline is to configure the sink or destination for the imported data. Hydrator provides capabilities to store data in time-partitioned directories via a built-in CDAP Dataset called Time-partitioned File Set.  Once the data is stored in the fileset, CDAP automatically adds a partition which can be queried using Hive.

In this use case, we will configure a Time-partitioned File Set that stores data in Avro format by using TPFSAvro as the sink.

Connecting the Source and the Destination

Now let’s connect the source and the sink. The schema configured in the source is automatically copied to the sink and will be used.

image04

And Finally, Running the Data Pipeline

We have now built the logical pipeline for bringing data in from the database. The next step is to publish the pipeline, which converts the logical pipeline into a physical execution plan using workflows of either MapReduce or Spark jobs.

image05

Simply click on Run to execute the pipeline once. A pipeline run launches a workflow that runs a MapReduce program that will read the data from the database source and write to the Time-partitioned File Set sink. The pipeline can also be scheduled to run periodically on a time base.

Exploring the Data in Hadoop

The data that is fetched can be explored using Hive queries from the Explore page of CDAP. Users can give a Hive QL query to explore.

image03image00

We demonstrated how Hydrator can be used to easily build data ingest pipelines with a few clicks. Learn more by downloading CDAP, which includes Hydrator, and try out your own use cases.

Easily deploy & scale your data pipelines in clicks. Run Spark, Kafka, Cassandra + more on shared infrastructure and blow away your data silos. Learn how with Mesosphere DC/OS.

Topics:
relational data ,data lakes

Published at DZone with permission of Shankar Selvam, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}