Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Running Airflow on Top of Apache Mesos

DZone's Guide to

Running Airflow on Top of Apache Mesos

Want to orchestrate your workflows? Learn how to use Apache Airflow with CeleryExecutor and MesosExecutor to do just that.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Apache Airflow is a wonderful product — possibly one of the best when it comes to orchestrating workflows. Airflow supports different executors for running these workflows, namely LocalExecutor, SequentialExecutor, and CeleryExecutor. Out of these, only CeleryExecutor supports the distributed execution of these tasks. A typical multi-node cluster setup using CeleryExecutor looks like the following and here is an excellent resource explaining the setup.

Image title

There is one more community-contributed Executor that allows us to run Airflow tasks on Apache MesosMesosExecutor. This executor basically allows Airflow to be registered as a framework alongside others in Mesos. What that means is that as the Mesos slaves become free, they are offered as "resource offers" to one of the registered frameworks by the Mesos Allocator.

With Airflow running on Mesos, the whole deployment architecture looks like:

Image title

This is how it works:

  1. The Mesos Executor implements a Scheduler interface to accept these resource offers and create tasks and ask the MesosSchedulerDriver to launch these tasks on the slaves that were part of the accepted resource offers.
  2. Upon accepting the resource offer, the scheduler creates a Task and specifies the command to be executed when the task is run on the slave.
  3. MesosSchedulerDriver then coordinates with the Mesos master to run these tasks on the Mesos slaves using the default executor.

There is just one issue that I see with Mesos Executor: it assumes that the Mesos slaves already have Airflow installed on them so that they can run those commands, which are actually Airflow commands. IMHO, this goes against the Mesos philosophy, which advocates for a heterogeneous cluster running different types of jobs as opposed to having separate clusters for running Hadoop, Spark, or Airflow jobs.

Mesos Python Eggs

Since Airflow is written in Python, it uses the Python bindings for Mesos (AKA Python eggs for Mesos). Previously, these eggs were directly downloadable from Mesosphere, but they don't seem to be available for direct download now, so the workaround is to build Mesos for your platform and then extract the eggs from the installation and put them in your Docker for Airflow. A small script to achieve this fetching and then installing them looks like:

wget --progress=dot "$MESOS_BASEURL/$mesos_version/mesos-${mesos_version}.tar.gz"
tar zxvf mesos-${mesos_version}.tar.gz
cd mesos-$mesos_version
./configure
make
find . -name '*.egg' -exec cp {} /tmp/eggs \;
cd /tmp/eggs
easy_install mesos.interface-1.4.1-py2.7.egg && 
easy_install mesos.scheduler-1.4.1-py2.7-linux-x86_64.egg && 
easy_install mesos.executor-1.4.1-py2.7-linux-x86_64.egg && 
easy_install mesos.native-1.4.1-py2.7.egg

Note that mesos.cli and mesos.interface are installable via pip but at the time of this writing, other bindings need to be installed as eggs using easy_install.

Once this is out of the way, you can now use MesosExecutor and run the Airflow tasks on Mesos slaves easily.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,mesos ,apache airflow ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}