Building an ETL Pipeline With Airflow and ECS
Get your data flowing.
Join the DZone community and get the full member experience.Join For Free
Each day, enterprise-level companies collect, store and process different types of data from multiple sources. Whether it’s a payroll system, sales records, or inventory system, this torrent of data has to be attended to.
And if you process data from multiple sources that you want to squeeze into a centralized database, you need to:
And Load data into the target database.
That’s what the ETL systems are about.
The Challenge of ETL Pipelines
Usually, one ETL tool is used to cover all three of these steps, i.e., extracting, cleaning, and loading data. ETLs have become an essential aspect that ensures the usability of data for analytics, reporting, or ML systems. However, the nature of ETL and data has evolved significantly over the last few years. That is why building an ETL pipeline is somewhat challenging these days.
Every load is an uphill struggle and here are some of the common mishaps:
Varying type casting
Delimiter dramas and duplicates
The integrity of sequential jobs
Parallel loads and data volumes
Data loss during loads
….the list goes on.
Obviously, we can use one of the many ready-made ETL systems that implement the functions of loading information into the corporate data warehouse. Informatica PowerCenter, Oracle Data Integrator, SAP Data Services, Oracle Warehouse Builder, Talend Open Studio, Pentaho are just a sliver of off-the-shelf solutions. However, when it comes to large volumes of data at high speeds and Big Data infrastructure already in place, boxed solutions fall flat to satisfy your needs.
Therefore, Big Data pipelines require something like Apache Airflow. It’s an open-source set of libraries for developing, planning, and monitoring workflows. Airflow is written in Python and allows you to create and configure task chains both visually with a clear web-GUI and to write Python program code.
Among the top reasons, Airflow offers:
Pipelines configured as code (Python), which works wonders for dynamic pipeline generation;
Modest, but a complete toolkit for creating and managing data processes - 3 types of operators (sensors, handlers, and transfers) - you can also write your own;
High scalability that allows for adding or deleting workers easily;
Flexible task dependency definitions with subdags and task branching;
Flexible schedule settings and backfilling;
Support for task priority settings and load management is built-in;
Different types of connections, including DB, S3, SSH, HDFS, etc.;
Exceptional web-based graphical interface for creating data pipelines that display graph view, tree view, task duration, and others;
Integration with multiple sources and services - databases (MySQL, PostgreSQL, DynamoDB, Hive), Big Data storage (HDFS, Amazon S3), and cloud platforms (Google Cloud Platform, Amazon Web Services, Microsoft Azure).
Let’s Build An ETL Pipeline Like a Boss
In this post, I’ll be setting up the pipeline with the following tools:
AWS ECS/Fargate: we’ll use it to run containers without having to manage servers or clusters of Amazon EC2 instances.
AWS s3: AWS object storage service.
The sine qua non is a ready-to-deploy ETL script written in Python. And as we'll be using a container-based application, we'll need to create a Docker image that has all of the necessary components to perform the ETL process and configure it.
With the docker push command, you can push your container images to an Amazon ECR repository. Create and push Docker manifest lists, which are used for multi-architecture images, with Amazon ECR. Every image in a manifest list must also be pushed to your repository.
Create the ECS Cluster
Here’s what it takes to create a cluster:
Create a cluster (L for logic). Fargate is a launch type that allows you to run containerized apps without having to maintain EC2 instances; instead, you pay for activities that are completed.
Task definition: this step prepares your application to operate on Amazon ECS. The task definition is a JSON-formatted text file that describes one or more containers (up to 10) that make up your application. It's similar to a blueprint for your application. The parameters for your application are also specified in task definitions.
Add a Container: Next, you need to add your container to ECS Fargate.
This is not an exhaustive list of steps needed to create a cluster. Check Youtube tutorials for more comprehensive information.
Deploy Apache Airflow
The cool thing about Airflow is that it turns your complex ETL project into a Python one, which means you can tweak it as you like. Whether it’s your team size or infrastructure features, Airflow allows you to manipulate your ETL. Also, Airflow notably benefits from a wide range of community-contributed operators.
First, you need to create a Docker image for Airflow. You can either leverage the official one in DockerHub or hone up your skills by creating it on your own.
After building the docker image, you’ll need a volume that maps the directory on the local machine with DAG definitions and the locations where Airflow reads them on the container.
If you’re embracing Airflow for ETL (which you def are), you’ll have to connect your Airflow deployment to the databases. And that’s when Airflow Connections come on stage. This whole process is quite easy in Airflow, requiring only the standard setup of a login, password, host, and port. You can add Connections using the Airflow UI, or you can create them automatically using environment variables.
A DAG or a Directed Acyclic Graph is some semantic grouping of your tasks that you want to perform in a strictly defined sequence according to a certain schedule. Airflow provides a convenient web interface for working with DAGs and other entities.
Your DAG should contain a unique identifier (dag_id), an argument dictionary, a start interval, and variables. DAGs list tasks as operators with defined relationships between them. Once everything is ready, all you need is to run your DAG.
This post goes over the main stages of creating an ETL pipeline with Airflow. Usually, most data professionals struggle to run an ETL pipeline, with many obstacles that stand between them and a successful ETL implementation. I now hope that this article will help you to get the basics right and pull off this business-critical process.
Opinions expressed by DZone contributors are their own.