Part 1: Deploying a Distributed SQL Backend for Apache Airflow on Google Cloud
Part 1: Deploying a Distributed SQL Backend for Apache Airflow on Google Cloud
In this post we’ll show you how to get Airflow configured to use a highly available, distributed SQL backend.
Join the DZone community and get the full member experience.Join For Free
Apache Airflow is a popular platform for programmatically authoring, scheduling, and monitoring workflows. Airflow has been deployed by companies like Adobe, Airbnb, Etsy, Instacart, and Square. The advantage of defining workflows as code is that they become more maintainable, versionable, testable, and collaborative. Airflow is used to author these workflows as directed acyclic graphs (DAGs) of tasks. Airflow’s scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. Its browser-based UI makes it easy to visualize pipelines running in production, monitor their progress, and troubleshoot issues when needed.
Why Airflow With a YugabyteDB Backend?
By default, Airflow makes use of a SQLite database for its metadata store, which both the scheduler and web UI rely on. Typically, when Airflow is used in production, the SQLite backend is replaced with a traditional RDBMS like PostgreSQL. However, in order for PostgreSQL not to become a single point of failure in the Airflow deployment, administrators will still need to devise high-availability and failover strategies for PostgreSQL. There’s a simpler solution that Airflow can interact with just like PostgreSQL, but has the added advantages of high availability, support for multiple cloud and topology deployment options, plus high performance built in; it is YugabyteDB.
What’s YugabyteDB? It is an open source, high-performance distributed SQL database built on a scalable and fault-tolerant design inspired by Google Spanner. Yugabyte’s SQL API (YSQL) is PostgreSQL wire compatible.
When YugabyteDB is used in place of PostgreSQL in an Airflow deployment, the administrator no longer has to worry about maintaining complex PostgreSQL replication and failover schemes to minimize the possibility of a failed backend causing dataflows to fail. Instead, a distributed SQL backend like YugabyteDB can deliver PostgreSQL-compatibility, high availability, geo data distribution, and high performance in a single open source system.
In this post, part one of a two part blog series, we’ll show you how to get Airflow configured to use YugabyteDB as a backend. In part two, we’ll show you how to build an Airflow workflow that migrates data between PostgreSQL and YugabyteDB.
For the purposes of this demo, our Airflow and PostgreSQL installations will make use of virtual machines on Google Compute Engine and YugabyteDB will be installed on Google Kubernetes Engine.
In this post we’ll cover the following steps:
- Deploy a VM for Airflow
- Configure a firewall rule for the Airflow UI
- Install Apache Airflow
- Start the Airflow Service
- Set up a GKE Cluster
- Install YugabyteDB on GKE
- Build the Airflow backend on YugabyteDB
- Restart Airflow with a YugabyteDB backend
Below is the environment that we’ll be using for the purposes of this blog.
Note: For the purposes of this demo, we aren’t creating a particularly secure deployment, instead we are focusing on demonstrating how to wire everything up with the least amount of fuss. In a production deployment you’ll want to enforce additional security measures throughout the stack.
Step 1: Deploy a VM for Airflow
For the purposes of this demo, I have specified the following configuration for the Google Compute Engine VM that will be hosting my Airflow server.
- Name: airflowvm
- Machine Type: n1-standard-2 (2vCPU, 7.5 GB memory)
- OS: Ubuntu 18.04 LTS
- Disk: 100 GB
- Firewall: Allow HTTP/HTTPS traffic
Step 2: Configure a Firewall Rule for the Airflow UI
To access the Airflow Web UI we’ll need to set up a firewall rule to expose an Airflow ingress on port 8080. In the Google Cloud Console you’ll want to open up the Firewall rules page and create a rule with the following characteristics:
- Name: airflowwebui
- Direction: Ingress
- Targets: All instances in the network
- Source Filter: IP ranges
- Source IP Ranges: 0.0.0.0/0
- Protocols and ports: TCP, 8080
Step 3: Install Apache Airflow
Next, let’s install Apache Airflow on the VM by running the following commands:
Verify that Apache Airflow is installed by executing the following command:
Step 4: Start the Airflow Service
Initialize the Default Backend
To get Airflow initialized with the default SQLite backend we’ll need to execute the following commands:
Start the Airflow Scheduler
Next, start the Airflow scheduler service by running the following command:
Start the Airflow Web Server
Start the Airflow web server with:
Access the Airflow web UI
To access the web UI, go to http://<airflowvm IP>:8080
At this point we have Airflow up and running on GCP. The next step is to set up YugabyteDB on GKE.
Step 5: Set up a GKE Cluster
If you recall, YugabyteDB will serve two purposes in this demo. First, as a backend for Airflow and second, as a target for our example DAG that will detect and move updates from PostgreSQL tables to YugabyteDB.
For the purposes of this demo I have specified the following configuration for my GKE cluster that will host YugabyteDB:
- Name: yugabytedbgke
- Number of Nodes: 3
- Machine Type: n1-standard-8 (8vCPU, 30 GB memory)
- Max Pods: 48
Once the GKE cluster is live, we are ready to log into the cluster and install YugabyteDB.
Step 6: Install YugabyteDB on GKE
Note: As of the writing of this post, the current version of YugabyteDB is 2.1.6. You can check for the latest version of YugabyteDB on our GitHub releases page.
Create a Service Account
Before we can create the cluster, we need a service account that has been granted the cluster-admin role. Use the following command to create a yugabyte-helm service account granted with the ClusterRole of cluster-admin.
Create a Namespace
Add the Charts Repository
Update the Repository
To check the status of the YugabyteDB cluster, execute the command below:
You can also navigate to the Services & Ingress view in the GCP Console and take note of the various endpoints, IPs, and ports.
Congrats! At this point you have a three node YugabyteDB cluster running on GKE.
Step 7: Build the Airflow Backend on YugabyteDB
YugabyteDB currently doesn’t support the
ALTER TABLE ADD PRIMARY KEY command (GitHub Issue #1104) which means that the Airflow
initdb command won’t work out of the box. The simple workaround is to move the
ALTER statements at the end of the
initdb build script into the
CREATE TABLE statements. Log into into GKE and download the YugabyteDB-friendly version of the Airflow backend build script here:
After downloading the script, you’ll need to copy the script to a yb-tserver pod if you’d like to execute it from the command line. Modify the command below to account for the directory structure of your setup:
Create a Password for the Default YugabyteDB User
For Airflow to be able to connect to the YugabyteDB cluster, we need to create a password for the default yugabyte user, which by default has none.
Log into YugabyteDB:
Assign yugabyte a password:
Create a database for the Airflow repository
Create a database called airflowybrepo for Airflow to use as a metadata repository. Either execute the following script in your shell or use your favorite GUI tool.
Build the Airflow Database Objects
Execute the airflowybrepo.sql script:
After it completes you should see 23 tables populated with initialization data, plus 15 sequences and 44 indexes.
Step 8: Restart Airflow With a YugabyteDB Backend
Modify the Backend Connect String Airflow Uses
Return to the airflowvm VM and update the airflow.cfg file with your YugabyteDB specific information and save the file:
Restart the Airflow Scheduler
Restart the Airflow web server
We can also verify that there is connectivity to the YugabyteDB backend by executing:
Refresh the Airflow web UI
Return to the Airflow web UI at http://<airflowvm IP>:8080 and refresh your browser.
That’s it! You now have Airflow deployed on a Google Compute Engine VM with a highly available, PostgreSQL compatible YugabyteDB backend running on Google Kubernetes Engine.
In part two of this blog series, we’ll show you how to configure and run a workflow that detects and migrates data changes between PostgreSQL and YugabyeDB. In that post we’ll cover:
- Installing PostgreSQL on Google Cloud
- Configuring GCP firewall rules
- Configuring Airflow database connections
- Creating an Airflow task file
- Running the task
- Monitoring and verifying the results
Published at DZone with permission of Jimmy Guerrero . See the original article here.
Opinions expressed by DZone contributors are their own.