{{announcement.body}}
{{announcement.title}}

Part 1: Deploying a Distributed SQL Backend for Apache Airflow on Google Cloud

DZone 's Guide to

Part 1: Deploying a Distributed SQL Backend for Apache Airflow on Google Cloud

In this post we’ll show you how to get Airflow configured to use a highly available, distributed SQL backend.

· Database Zone ·
Free Resource

Apache Airflow is a popular platform for programmatically authoring, scheduling, and monitoring workflows. Airflow has been deployed by companies like Adobe, Airbnb, Etsy, Instacart, and Square. The advantage of defining workflows as code is that they become more maintainable, versionable, testable, and collaborative. Airflow is used to author these workflows as directed acyclic graphs (DAGs) of tasks. Airflow’s scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. Its browser-based UI makes it easy to visualize pipelines running in production, monitor their progress, and troubleshoot issues when needed.

Why Airflow With a YugabyteDB Backend?

By default, Airflow makes use of a SQLite database for its metadata store, which both the scheduler and web UI rely on. Typically, when Airflow is used in production, the SQLite backend is replaced with a traditional RDBMS like PostgreSQL. However, in order for PostgreSQL not to become a single point of failure in the Airflow deployment, administrators will still need to devise high-availability and failover strategies for PostgreSQL. There’s a simpler solution that Airflow can interact with just like PostgreSQL, but has the added advantages of high availability, support for multiple cloud and topology deployment options, plus high performance built in; it is YugabyteDB.

What’s YugabyteDB? It is an open source, high-performance distributed SQL database built on a scalable and fault-tolerant design inspired by Google Spanner. Yugabyte’s SQL API (YSQL) is PostgreSQL wire compatible.

When YugabyteDB is used in place of PostgreSQL in an Airflow deployment, the administrator no longer has to worry about maintaining complex PostgreSQL replication and failover schemes to minimize the possibility of a failed backend causing dataflows to fail. Instead, a distributed SQL backend like YugabyteDB can deliver PostgreSQL-compatibility, high availability, geo data distribution, and high performance in a single open source system.

In this post, part one of a two part blog series, we’ll show you how to get Airflow configured to use YugabyteDB as a backend. In part two, we’ll show you how to build an Airflow workflow that migrates data between PostgreSQL and YugabyteDB.

For the purposes of this demo, our Airflow and PostgreSQL installations will make use of virtual machines on Google Compute Engine and YugabyteDB will be installed on Google Kubernetes Engine.

In this post we’ll cover the following steps:

  • Deploy a VM for Airflow
  • Configure a firewall rule for the Airflow UI
  • Install Apache Airflow
  • Start the Airflow Service
  • Set up a GKE Cluster
  • Install YugabyteDB on GKE
  • Build the Airflow backend on YugabyteDB
  • Restart Airflow with a YugabyteDB backend

Prerequisites

Below is the environment that we’ll be using for the purposes of this blog.

Note: For the purposes of this demo, we aren’t creating a particularly secure deployment, instead we are focusing on demonstrating how to wire everything up with the least amount of fuss. In a production deployment you’ll want to enforce additional security measures throughout the stack.

Step 1: Deploy a VM for Airflow

For the purposes of this demo, I have specified the following configuration for the Google Compute Engine VM that will be hosting my Airflow server.

  • Name: airflowvm
  • Machine Type: n1-standard-2 (2vCPU, 7.5 GB memory)
  • OS: Ubuntu 18.04 LTS
  • Disk: 100 GB
  • Firewall: Allow HTTP/HTTPS traffic

Step 2: Configure a Firewall Rule for the Airflow UI

To access the Airflow Web UI we’ll need to set up a firewall rule to expose an Airflow ingress on port 8080. In the Google Cloud Console you’ll want to open up the Firewall rules page and create a rule with the following characteristics:

  • Name: airflowwebui
  • Direction: Ingress
  • Targets: All instances in the network
  • Source Filter: IP ranges
  • Source IP Ranges: 0.0.0.0/0
  • Protocols and ports: TCP, 8080

Step 3: Install Apache Airflow

Next, let’s install Apache Airflow on the VM by running the following commands:

Java
 




x


 
1
$ cd ~/
2
$ sudo apt-get update
3
$ sudo apt-get install python3-pip
4
$ export AIRFLOW_HOME=~/airflow
5
$ sudo pip3 install apache-airflow[postgres]


Verify that Apache Airflow is installed by executing the following command:

Java
 




xxxxxxxxxx
1


 
1
$ airflow version
2
v1.10.10


Step 4: Start the Airflow Service

Initialize the Default Backend

To get Airflow initialized with the default SQLite backend we’ll need to execute the following commands:

Java
 




xxxxxxxxxx
1


 
1
$ cd ~/
2
$ cd airflow
3
$ airflow initdb


Start the Airflow Scheduler

Next, start the Airflow scheduler service by running the following command:

Java
 




xxxxxxxxxx
1


 
1
$ airflow scheduler -D


Start the Airflow Web Server

Start the Airflow web server with:

Java
 




xxxxxxxxxx
1


 
1
$ airflow webserver -p 8080 -D


Access the Airflow web UI

To access the web UI, go to http://<airflowvm  IP>:8080

At this point we have Airflow up and running on GCP. The next step is to set up YugabyteDB on GKE.

Step 5: Set up a GKE Cluster

If you recall, YugabyteDB will serve two purposes in this demo. First, as a backend for Airflow and second, as a target for our example DAG that will detect and move updates from PostgreSQL tables to YugabyteDB.

For the purposes of this demo I have specified the following configuration for my GKE cluster that will host YugabyteDB:

YugabyteDB

  • Name: yugabytedbgke
  • Number of Nodes: 3
  • Machine Type: n1-standard-8 (8vCPU, 30 GB memory)
  • Max Pods: 48

Once the GKE cluster is live, we are ready to log into the cluster and install YugabyteDB.

Step 6: Install YugabyteDB on GKE

Note: As of the writing of this post, the current version of YugabyteDB is 2.1.6. You can check for the latest version of YugabyteDB on our GitHub releases page.

Upgrade Helm

Java
 




xxxxxxxxxx
1


 
1
$ helm init --upgrade --wait
2
 
          
3
$HELM_HOME has been configured at /home/jimmy/.helm.
4
Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.
5
Please note: by default, Tiller is deployed with an insecure 'allow unauthenticated users' policy.
6
To prevent this, run `helm init` with the --tiller-tls-verify flag.
7
For more information on securing your installation see: https://docs.helm.sh/using_helm/#securing-your-helm-installation


Create a Service Account

Before we can create the cluster, we need a service account that has been granted the cluster-admin role. Use the following command to create a yugabyte-helm service account granted with the ClusterRole of cluster-admin.

Java
 




xxxxxxxxxx
1


 
1
$ kubectl create -f https://raw.githubusercontent.com/yugabyte/charts/master/stable/yugabyte/yugabyte-rbac.yaml
2
 
          
3
serviceaccount/yugabyte-helm created
4
clusterrolebinding.rbac.authorization.k8s.io/yugabyte-helm created


Initialize Helm

Java
 




xxxxxxxxxx
1


 
1
$ helm init --service-account yugabyte-helm --upgrade --wait
2
 
          
3
$HELM_HOME has been configured at /home/jimmy/.helm.
4
Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.
5
 
          
6
Please note: by default, Tiller is deployed with an insecure 'allow unauthenticated users' policy.
7
To prevent this, run `helm init` with the --tiller-tls-verify flag.
8
For more information on securing your installation see: https://docs.helm.sh/using_helm/#securing-your-helm-installation


Create a Namespace

Java
 




xxxxxxxxxx
1


 
1
$ kubectl create namespace yb-demo
2
 
          
3
namespace/yb-demo created


Add the Charts Repository

Java
 




xxxxxxxxxx
1


 
1
$ helm repo add yugabytedb https://charts.yugabyte.com
2
 
          
3
"yugabytedb" has been added to your repositories


Update the Repository

Java
 




xxxxxxxxxx
1


 
1
$ helm repo update
2
 
          
3
Hang tight while we grab the latest from your chart repositories...
4
...Skip local chart repository
5
...Successfully got an update from the "yugabytedb" chart repository
6
...Successfully got an update from the "stable" chart repository
7
Update Complete.


Install YugabyteDB

Java
 




xxxxxxxxxx
1


 
1
$ helm install yugabytedb/yugabyte -f --namespace yb-demo --name yb-demo --wait


To check the status of the YugabyteDB cluster, execute the command below:

Java
 




xxxxxxxxxx
1


 
1
$ helm status yb-demo


You can also navigate to the Services & Ingress view in the GCP Console and take note of the various endpoints, IPs, and ports.

Congrats! At this point you have a three node YugabyteDB cluster running on GKE.

Step 7: Build the Airflow Backend on YugabyteDB

YugabyteDB currently doesn’t support the ALTER TABLE ADD PRIMARY KEY command (GitHub Issue #1104) which means that the Airflow initdb command won’t work out of the box. The simple workaround is to move the ALTER statements at the end of the initdb build script into the CREATE TABLE statements. Log into into GKE and download the YugabyteDB-friendly version of the Airflow backend build script here:

Java
 




xxxxxxxxxx
1


 
1
wget https://github.com/yugabyte/yugabyte-db/blob/master/sample/airflowybrepo.sql


After downloading the script, you’ll need to copy the script to a yb-tserver pod if you’d like to execute it from the command line. Modify the command below to account for the directory structure of your setup:

Java
 




xxxxxxxxxx
1


 
1
kubectl cp /home/jimmy/airflowybrepo.sql yb-demo/yb-tserver-0:/home/yugabyte/share -c yb-tserver


Create a Password for the Default YugabyteDB User

For Airflow to be able to connect to the YugabyteDB cluster, we need to create a password for the default yugabyte user, which by default has none.

Log into YugabyteDB:

Java
 




xxxxxxxxxx
1


 
1
$ kubectl --namespace yb-demo exec -it yb-tserver-0 -- /home/yugabyte/bin/ysqlsh -h yb-tserver-0


Assign yugabyte a password:

Java
 




xxxxxxxxxx
1


 
1
yugabyte=# ALTER USER yugabyte WITH PASSWORD 'password';
2
ALTER ROLE


Create a database for the Airflow repository

Create a database called airflowybrepo for Airflow to use as a metadata repository. Either execute the following script in your shell or use your favorite GUI tool.

Java
 




xxxxxxxxxx
1


 
1
yugabyte=# CREATE DATABASE airflowybrepo;
2
 
          
3
yugabyte=# \c airflowybrepo;


Build the Airflow Database Objects

Execute the airflowybrepo.sql script:

Java
 




xxxxxxxxxx
1


 
1
airflowybrepo=#\i /home/yugabyte/share/airflowybrepo.sql


After it completes you should see 23 tables populated with initialization data, plus 15 sequences and 44 indexes.

Step 8: Restart Airflow With a YugabyteDB Backend

Modify the Backend Connect String Airflow Uses

Return to the airflowvm VM and update the airflow.cfg file with your YugabyteDB specific information and save the file:

Java
 




xxxxxxxxxx
1


 
1
$ cd airflow
2
$ vim airflow.cfg
3
 
          
4
sql_alchemy_conn = postgres://yugabyte:password@<YUGABYTEDB-CLUSTER -IP>:5433/airflowybrepo


Restart the Airflow Scheduler

Java
 




xxxxxxxxxx
1


 
1
$ airflow scheduler -D


Restart the Airflow web server

Java
 




xxxxxxxxxx
1


 
1
$ airflow webserver -p 8080 -D


We can also verify that there is connectivity to the YugabyteDB backend by executing:

Java
 




xxxxxxxxxx
1


 
1
$ airflow checkdb
2
 
          
3
DB: postgres://yugabyte:***@104.197.36.19:5433/airflowybrepo
4
[2020-04-21 02:53:45,254] {db.py:430} INFO - Connection successful.


Refresh the Airflow web UI

Return to the Airflow web UI at http://<airflowvm  IP>:8080 and refresh your browser.

That’s it! You now have Airflow deployed on a Google Compute Engine VM with a highly available, PostgreSQL compatible YugabyteDB backend running on Google Kubernetes Engine.

What’s Next?

In part two of this blog series, we’ll show you how to configure and run a workflow that detects and migrates data changes between PostgreSQL and YugabyeDB. In that post we’ll cover:

  • Installing PostgreSQL on Google Cloud
  • Configuring GCP firewall rules
  • Configuring Airflow database connections
  • Creating an Airflow task file
  • Running the task
  • Monitoring and verifying the results

Stay tuned!

Topics:
airflow, apache airflow, google cloud, kubernetes, postgresql, yugabytedb

Published at DZone with permission of Jimmy Guerrero . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}