Part 1: Deploying a Distributed SQL Backend for Apache Airflow on Google Cloud
In this post we’ll show you how to get Airflow configured to use a highly available, distributed SQL backend.
Join the DZone community and get the full member experience.
Join For FreeApache Airflow is a popular platform for programmatically authoring, scheduling, and monitoring workflows. Airflow has been deployed by companies like Adobe, Airbnb, Etsy, Instacart, and Square. The advantage of defining workflows as code is that they become more maintainable, versionable, testable, and collaborative. Airflow is used to author these workflows as directed acyclic graphs (DAGs) of tasks. Airflow’s scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. Its browser-based UI makes it easy to visualize pipelines running in production, monitor their progress, and troubleshoot issues when needed.
Why Airflow With a YugabyteDB Backend?
By default, Airflow makes use of a SQLite database for its metadata store, which both the scheduler and web UI rely on. Typically, when Airflow is used in production, the SQLite backend is replaced with a traditional RDBMS like PostgreSQL. However, in order for PostgreSQL not to become a single point of failure in the Airflow deployment, administrators will still need to devise high-availability and failover strategies for PostgreSQL. There’s a simpler solution that Airflow can interact with just like PostgreSQL, but has the added advantages of high availability, support for multiple cloud and topology deployment options, plus high performance built in; it is YugabyteDB.
What’s YugabyteDB? It is an open source, high-performance distributed SQL database built on a scalable and fault-tolerant design inspired by Google Spanner. Yugabyte’s SQL API (YSQL) is PostgreSQL wire compatible.
When YugabyteDB is used in place of PostgreSQL in an Airflow deployment, the administrator no longer has to worry about maintaining complex PostgreSQL replication and failover schemes to minimize the possibility of a failed backend causing dataflows to fail. Instead, a distributed SQL backend like YugabyteDB can deliver PostgreSQL-compatibility, high availability, geo data distribution, and high performance in a single open source system.
In this post, part one of a two part blog series, we’ll show you how to get Airflow configured to use YugabyteDB as a backend. In part two, we’ll show you how to build an Airflow workflow that migrates data between PostgreSQL and YugabyteDB.
For the purposes of this demo, our Airflow and PostgreSQL installations will make use of virtual machines on Google Compute Engine and YugabyteDB will be installed on Google Kubernetes Engine.
In this post we’ll cover the following steps:
- Deploy a VM for Airflow
- Configure a firewall rule for the Airflow UI
- Install Apache Airflow
- Start the Airflow Service
- Set up a GKE Cluster
- Install YugabyteDB on GKE
- Build the Airflow backend on YugabyteDB
- Restart Airflow with a YugabyteDB backend
Prerequisites
Below is the environment that we’ll be using for the purposes of this blog.
- YugabyteDB – version 2.1.6
- Apache Airflow – version1 .10.10
- A Google Cloud Platform account
Note: For the purposes of this demo, we aren’t creating a particularly secure deployment, instead we are focusing on demonstrating how to wire everything up with the least amount of fuss. In a production deployment you’ll want to enforce additional security measures throughout the stack.
Step 1: Deploy a VM for Airflow
For the purposes of this demo, I have specified the following configuration for the Google Compute Engine VM that will be hosting my Airflow server.
- Name: airflowvm
- Machine Type: n1-standard-2 (2vCPU, 7.5 GB memory)
- OS: Ubuntu 18.04 LTS
- Disk: 100 GB
- Firewall: Allow HTTP/HTTPS traffic
Step 2: Configure a Firewall Rule for the Airflow UI
To access the Airflow Web UI we’ll need to set up a firewall rule to expose an Airflow ingress on port 8080. In the Google Cloud Console you’ll want to open up the Firewall rules page and create a rule with the following characteristics:
- Name: airflowwebui
- Direction: Ingress
- Targets: All instances in the network
- Source Filter: IP ranges
- Source IP Ranges: 0.0.0.0/0
- Protocols and ports: TCP, 8080
Step 3: Install Apache Airflow
Next, let’s install Apache Airflow on the VM by running the following commands:
$ cd ~/
$ sudo apt-get update
$ sudo apt-get install python3-pip
$ export AIRFLOW_HOME=~/airflow
$ sudo pip3 install apache-airflow[postgres]
Verify that Apache Airflow is installed by executing the following command:
xxxxxxxxxx
$ airflow version
v1.10.10
Step 4: Start the Airflow Service
Initialize the Default Backend
To get Airflow initialized with the default SQLite backend we’ll need to execute the following commands:
xxxxxxxxxx
$ cd ~/
$ cd airflow
$ airflow initdb
Start the Airflow Scheduler
Next, start the Airflow scheduler service by running the following command:
xxxxxxxxxx
$ airflow scheduler -D
Start the Airflow Web Server
Start the Airflow web server with:
xxxxxxxxxx
$ airflow webserver -p 8080 -D
Access the Airflow web UI
To access the web UI, go to http://<airflowvm IP>:8080
At this point we have Airflow up and running on GCP. The next step is to set up YugabyteDB on GKE.
Step 5: Set up a GKE Cluster
If you recall, YugabyteDB will serve two purposes in this demo. First, as a backend for Airflow and second, as a target for our example DAG that will detect and move updates from PostgreSQL tables to YugabyteDB.
For the purposes of this demo I have specified the following configuration for my GKE cluster that will host YugabyteDB:
YugabyteDB
- Name: yugabytedbgke
- Number of Nodes: 3
- Machine Type: n1-standard-8 (8vCPU, 30 GB memory)
- Max Pods: 48
Once the GKE cluster is live, we are ready to log into the cluster and install YugabyteDB.
Step 6: Install YugabyteDB on GKE
Note: As of the writing of this post, the current version of YugabyteDB is 2.1.6. You can check for the latest version of YugabyteDB on our GitHub releases page.
Upgrade Helm
xxxxxxxxxx
$ helm init --upgrade --wait
$HELM_HOME has been configured at /home/jimmy/.helm.
Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.
Please note: by default, Tiller is deployed with an insecure 'allow unauthenticated users' policy.
To prevent this, run `helm init` with the --tiller-tls-verify flag.
For more information on securing your installation see: https://docs.helm.sh/using_helm/#securing-your-helm-installation
Create a Service Account
Before we can create the cluster, we need a service account that has been granted the cluster-admin role. Use the following command to create a yugabyte-helm service account granted with the ClusterRole of cluster-admin.
xxxxxxxxxx
$ kubectl create -f https://raw.githubusercontent.com/yugabyte/charts/master/stable/yugabyte/yugabyte-rbac.yaml
serviceaccount/yugabyte-helm created
clusterrolebinding.rbac.authorization.k8s.io/yugabyte-helm created
Initialize Helm
xxxxxxxxxx
$ helm init --service-account yugabyte-helm --upgrade --wait
$HELM_HOME has been configured at /home/jimmy/.helm.
Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.
Please note: by default, Tiller is deployed with an insecure 'allow unauthenticated users' policy.
To prevent this, run `helm init` with the --tiller-tls-verify flag.
For more information on securing your installation see: https://docs.helm.sh/using_helm/#securing-your-helm-installation
Create a Namespace
xxxxxxxxxx
$ kubectl create namespace yb-demo
namespace/yb-demo created
Add the Charts Repository
xxxxxxxxxx
$ helm repo add yugabytedb https://charts.yugabyte.com
"yugabytedb" has been added to your repositories
Update the Repository
xxxxxxxxxx
$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Skip local chart repository
...Successfully got an update from the "yugabytedb" chart repository
...Successfully got an update from the "stable" chart repository
Update Complete.
Install YugabyteDB
xxxxxxxxxx
$ helm install yugabytedb/yugabyte -f --namespace yb-demo --name yb-demo --wait
To check the status of the YugabyteDB cluster, execute the command below:
xxxxxxxxxx
$ helm status yb-demo
You can also navigate to the Services & Ingress view in the GCP Console and take note of the various endpoints, IPs, and ports.
Congrats! At this point you have a three node YugabyteDB cluster running on GKE.
Step 7: Build the Airflow Backend on YugabyteDB
YugabyteDB currently doesn’t support the ALTER TABLE ADD PRIMARY KEY
command (GitHub Issue #1104) which means that the Airflow initdb
command won’t work out of the box. The simple workaround is to move the ALTER
statements at the end of the initdb
build script into the CREATE TABLE
statements. Log into into GKE and download the YugabyteDB-friendly version of the Airflow backend build script here:
xxxxxxxxxx
wget https://github.com/yugabyte/yugabyte-db/blob/master/sample/airflowybrepo.sql
After downloading the script, you’ll need to copy the script to a yb-tserver pod if you’d like to execute it from the command line. Modify the command below to account for the directory structure of your setup:
xxxxxxxxxx
kubectl cp /home/jimmy/airflowybrepo.sql yb-demo/yb-tserver-0:/home/yugabyte/share -c yb-tserver
Create a Password for the Default YugabyteDB User
For Airflow to be able to connect to the YugabyteDB cluster, we need to create a password for the default yugabyte user, which by default has none.
Log into YugabyteDB:
xxxxxxxxxx
$ kubectl --namespace yb-demo exec -it yb-tserver-0 -- /home/yugabyte/bin/ysqlsh -h yb-tserver-0
Assign yugabyte a password:
xxxxxxxxxx
yugabyte=# ALTER USER yugabyte WITH PASSWORD 'password';
ALTER ROLE
Create a database for the Airflow repository
Create a database called airflowybrepo for Airflow to use as a metadata repository. Either execute the following script in your shell or use your favorite GUI tool.
xxxxxxxxxx
yugabyte=# CREATE DATABASE airflowybrepo;
yugabyte=# \c airflowybrepo;
Build the Airflow Database Objects
Execute the airflowybrepo.sql script:
xxxxxxxxxx
airflowybrepo=#\i /home/yugabyte/share/airflowybrepo.sql
After it completes you should see 23 tables populated with initialization data, plus 15 sequences and 44 indexes.
Step 8: Restart Airflow With a YugabyteDB Backend
Modify the Backend Connect String Airflow Uses
Return to the airflowvm VM and update the airflow.cfg file with your YugabyteDB specific information and save the file:
xxxxxxxxxx
$ cd airflow
$ vim airflow.cfg
sql_alchemy_conn = postgres://yugabyte:password@<YUGABYTEDB-CLUSTER -IP>:5433/airflowybrepo
Restart the Airflow Scheduler
xxxxxxxxxx
$ airflow scheduler -D
Restart the Airflow web server
xxxxxxxxxx
$ airflow webserver -p 8080 -D
We can also verify that there is connectivity to the YugabyteDB backend by executing:
xxxxxxxxxx
$ airflow checkdb
DB: postgres://yugabyte:***@104.197.36.19:5433/airflowybrepo
[2020-04-21 02:53:45,254] {db.py:430} INFO - Connection successful.
Refresh the Airflow web UI
Return to the Airflow web UI at http://<airflowvm IP>:8080 and refresh your browser.
That’s it! You now have Airflow deployed on a Google Compute Engine VM with a highly available, PostgreSQL compatible YugabyteDB backend running on Google Kubernetes Engine.
What’s Next?
In part two of this blog series, we’ll show you how to configure and run a workflow that detects and migrates data changes between PostgreSQL and YugabyeDB. In that post we’ll cover:
- Installing PostgreSQL on Google Cloud
- Configuring GCP firewall rules
- Configuring Airflow database connections
- Creating an Airflow task file
- Running the task
- Monitoring and verifying the results
Stay tuned!
Published at DZone with permission of Jimmy Guerrero. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
How To Manage Vulnerabilities in Modern Cloud-Native Applications
-
How Web3 Is Driving Social and Financial Empowerment
-
Explainable AI: Making the Black Box Transparent
-
Boosting Application Performance With MicroStream and Redis Integration
Comments