Running Data Preparations on Your Data Lake With Talend and Apache Beam

DZone 's Guide to

Running Data Preparations on Your Data Lake With Talend and Apache Beam

Learn about reading data directly from an HDFS file system and exporting a full data set out or writing it back to a different location on your HDFS.

· Big Data Zone ·
Free Resource

You may have seen recently that the first stable version of Apache Beam (v.2.0) was recently released. Apache Beam is an advanced unified programming model designed for batch and streaming data processing. It’s extremely powerful and portable which is why we’ve been actively contributing to the project since the very beginning. Recently, we’ve integrated Apache Beam into Talend Data Preparation. François Lacas wrote a great blog describing the new features in release and but I should help take you down a level and show you Apache Beam actually works in our Data Preparation product.

(Psst: If you are interested in learning more about the Apache Beam project, check out this link.)

Apache Beam 101

Apache Beam, at its core, is a way for users to provide a layer of abstraction from your integration patterns and actual runtime environment. This abstraction layer enables users to code data integration process using Beams SDKs and when I want to run the process I then pick what is call a Beam Runner for whatever processing architecture and runtime I need or want to use. This can be Spark, Google Data Flow, Flink, or whatever you will want to process data on in the future. Beam also works for both batch and streaming workloads. So, depending on what you are connecting to, Beam will know what type of runners you can use. The community building the different runners for the different platforms making this a true abstraction from integration code to runtime environments.

Beam Me Up to Better Data Prep

The video below shows two major capabilities of Talend Data Preparation; the second is utilizing Beam.

  1. Reading data directly from an HDFS file system.
  2. Exporting a full data set out or writing it back to a different location on your HDFS.  

Reading the Parquet file from HDFS is using Talend’s new Component Catalog Framework and SDKs. The fun part about the example in the video is the Parquet formatted file; it is a multi-part file with metadata files and data files. Talend Data Preparation can now read the metadata and pull column header names into the tool and it takes a sampling from all the multiple file parts to give the user a quality sample of the entire data set. You’ll see us run through this in the video below.

The second part in the video, where we export the data back to HDFS, is the really cool part. What the Talend Data Preparation Server is doing in the background is building a complete end-to-end Spark processing job and submitting that Spark job to your cluster using Beam and a Spark Runner as described above. Once you select Export in the Talend Data Preparation tool and choose the full data set to HDFS, it is going to export the connection information from the import process (Component Catalog) and the preparation steps or “the recipe” of the changes needed for the data preparation and the target location on the Spark Cluster.  The Talend Data Preparation server will send all this information to a Flowrunner which will take all the information and convert that into Apache Beam code. The Apache Beam code is then submitted to a Spark Job Server and this is a Beam Spark runner that has been configured to connect to your Spark Cluster using the proper security and access rights setup by your IT Administration team.  

The Spark Job Server or the Beam Runner submits the job (as native Spark code) to the Cluster’s Resource Manager and runs as desired in the cluster. The Spark Job Server is monitoring the status of the job on the Resource Manager and reports back the completion status once finished.  The Talend Data Preparation Server will provide the completion status to the user in the export history dialog of the preparation it ran.

By using Apache Beam for the back end of processing the Data Preparations, we will be able to allow customers to pick what solution they want to run their preparations. Today, it was Apache Spark; tomorrow it might be Flink or Apex. The beauty of Apache Beam is choice! No matter the processing technology, Talend Data Preparation will enable you to process and cleanse data using modern data tools.

apache beam, big data, data lake, data preparation, hdfs, talend

Published at DZone with permission of Mark Balkenende , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}