Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using Mesos, Airflow, and Docker Together

DZone's Guide to

Using Mesos, Airflow, and Docker Together

Learn how to fetch your Airflow's Docker image before actually running the Airflow command in your task on the slave.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

This is my second post in the journey of running Apache Airflow on top of Apache Mesos. Please read the first one if you haven't already to get the right context. This post is about enabling Docker support while running Airflow on a Mesos cluster. As I mentioned last time, in its current form, MesosExecutor does not allow for specifying Docker images while creating tasks for Mesos slaves, assuming that all the slave nodes have already installed and configured Airflow. This is a big assumption and something that goes against the philosophy of having a heterogeneous Mesos cluster (one Mesos cluster typically runs Spark, MapReduce, jobs, etc. as well as runs Airflow jobs and it could also be used to run some other services).

To rectify this, I proposed a solution. As of this writing, the PR is yet to be merged but you can just take the code and modify your local copy of MesosExecutor to enable support for specifying Docker images while creating a task for slaves to run.

DIY Guide

Until the PR is merged and a new version is released, one can follow the steps below to get the fix.

1. Dockerize Airflow

Firstly, you need to dockerize airflow. puckel/docker-airflow is a good place to start but feel free to copy bits and pieces and create your own Docker image as per your mileage. For example, add your DAGs and plugins to the vanilla Airflow in the Docker image. With a Docker-based setup, it becomes very easy to overlay certain files (in our case, mesos_executor.py) on the Airflow installed in the image. Keep the directory structure similar to that of Airflow and then you could just copy your modified files and override the already-installed Airflow files:

#Overlay Mesos Executor in installed Airflow
RUN . ${PYTHON_VIRTUAL_ENV_FOLDER}/bin/activate && \
    cp -r ${PYTHON_APP_FOLDER}/overlays/airflow/* /usr/src/venv/lib/python2.7/site-packages/airflow/

Now, with MesosExecutor turned on in the settings (localExecutor = MesosExecutor), you need to have Mesos Python eggs installed on the machines where airflow is supposed to run. Again, these eggs can be added to your Docker image until they become readily downloadable or pip-installable. For now, you can extract these eggs from a local installation of Mesos (by building it from source), as mentioned in the first post.

2. Specify Docker Image in Airflow Config

The proposed fix adds a new configuration to Airflow: docker_image_slave. You can specify this in your airflow.cfg file under the Mesos section. When a non-null configuration value is given, the modified MesosExecutor uses this Docker image while creating the task for Mesos slaves to run. Now, obviously, the specified image should be accessible from your Mesos slave machines. What this means is if the image is available in the public Docker repository, then the Mesos slave should be able to fetch those image from the Docker repository. Even if your dockerized Airflow image resides in one of your private repositories, there are different ways you can configure the Mesos agent to fetch the file from a private repository, all involving the standard Docker config file (docker.cfg), which contains the required authentication information required to pull an image from the private repository.

  1. Provide a URI for docker.cfg that will be pulled by each executor before fetching the image.
  2. Specify the absolute path to docker.cfg as configuration to the Mesos agent (see docker_config in Mesos agent configuration).
    mesos-agent --docker_config=file:///home/vagrant/.docker/config.json
  3. Specify the complete JSON config as a String to the Mesos agent using the same configuration. Example:
    --docker_config="{ \
      \"auths\": { \
        \"https://index.docker.io/v1/\": { \
          \"auth\": \"xXxXxXxXxXx=\", \
          \"email\": \"username@example.com\" \
        } \
      } \
    }"
  4. Manually update each Mesos agent on the slave with this docker.cfg file.

3. Run Mesos Slave in Docker Containerize Mode

Mesos 0.20 add supports for launching tasks that contains Docker images, so if you are on a version below that, sadly, this fix won't work for you. To enable the Mesos agent for Docker Containerizer, the agent must be started with the following parameters:

mesos-agent --containerizers=docker,mesos

Note that docker should come before mesos to enable this.

And that is all you need to do in order to fetch your Airflow's Docker image before actually running the Airflow command in your task on the slave.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,tutorial ,apache airflow ,apache mesos ,docker

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}