Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Running Airflow on Top of Apache Mesos

DZone's Guide to

Running Airflow on Top of Apache Mesos

Want to orchestrate your workflows? Learn how to use Apache Airflow with CeleryExecutor and MesosExecutor to do just that.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Apache Airflow is a wonderful product — possibly one of the best when it comes to orchestrating workflows. Airflow supports different executors for running these workflows, namely LocalExecutor, SequentialExecutor, and CeleryExecutor. Out of these, only CeleryExecutor supports the distributed execution of these tasks. A typical multi-node cluster setup using CeleryExecutor looks like the following and here is an excellent resource explaining the setup.

Image title

There is one more community-contributed Executor that allows us to run Airflow tasks on Apache MesosMesosExecutor. This executor basically allows Airflow to be registered as a framework alongside others in Mesos. What that means is that as the Mesos slaves become free, they are offered as "resource offers" to one of the registered frameworks by the Mesos Allocator.

With Airflow running on Mesos, the whole deployment architecture looks like:

Image title

This is how it works:

  1. The Mesos Executor implements a Scheduler interface to accept these resource offers and create tasks and ask the MesosSchedulerDriver to launch these tasks on the slaves that were part of the accepted resource offers.
  2. Upon accepting the resource offer, the scheduler creates a Task and specifies the command to be executed when the task is run on the slave.
  3. MesosSchedulerDriver then coordinates with the Mesos master to run these tasks on the Mesos slaves using the default executor.

There is just one issue that I see with Mesos Executor: it assumes that the Mesos slaves already have Airflow installed on them so that they can run those commands, which are actually Airflow commands. IMHO, this goes against the Mesos philosophy, which advocates for a heterogeneous cluster running different types of jobs as opposed to having separate clusters for running Hadoop, Spark, or Airflow jobs.

Mesos Python Eggs

Since Airflow is written in Python, it uses the Python bindings for Mesos (AKA Python eggs for Mesos). Previously, these eggs were directly downloadable from Mesosphere, but they don't seem to be available for direct download now, so the workaround is to build Mesos for your platform and then extract the eggs from the installation and put them in your Docker for Airflow. A small script to achieve this fetching and then installing them looks like:

wget --progress=dot "$MESOS_BASEURL/$mesos_version/mesos-${mesos_version}.tar.gz"
tar zxvf mesos-${mesos_version}.tar.gz
cd mesos-$mesos_version
./configure
make
find . -name '*.egg' -exec cp {} /tmp/eggs \;
cd /tmp/eggs
easy_install mesos.interface-1.4.1-py2.7.egg && 
easy_install mesos.scheduler-1.4.1-py2.7-linux-x86_64.egg && 
easy_install mesos.executor-1.4.1-py2.7-linux-x86_64.egg && 
easy_install mesos.native-1.4.1-py2.7.egg

Note that mesos.cli and mesos.interface are installable via pip but at the time of this writing, other bindings need to be installed as eggs using easy_install.

Once this is out of the way, you can now use MesosExecutor and run the Airflow tasks on Mesos slaves easily.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,mesos ,apache airflow ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}