{{announcement.body}}
{{announcement.title}}

Airflow to Orchestrate Machine Learning Algorithms

DZone 's Guide to

Airflow to Orchestrate Machine Learning Algorithms

This post suggests a possible, quick-to-implement solution for managing, scheduling, and running workflows.

· AI Zone ·
Free Resource

As a data engineer, a big challenge is to manage, schedule, and run workflows to prepare data, generate reports, and run algorithms. This post suggests a possible, quick-to-implement solution for these activities with a simple example.Image title

Managing and scheduling a data analysis workflow can be done in a lot of ways, but the more common ways are:

  • Cron job, directly on the operating system
  • Jenkins

Both ways don't scale well with the dimensions of the jobs: if a job fails at one of the stages of the workflow, your workflow needs to be restarted from scratch. If you want to create a better data pipeline with an easy-to-learn interface and a lot of useful features, you need to use a computational orchestrator like Apache Airflow.

Airflow is computational orchestrator because you can manage every kind of operation if you can write a workflow for that. This means that you can use Airflow to author workflows as Directed Acyclic Graphs (DAGs).

Airflow is composed of two elements: web server and scheduler.

A web server runs the user interface and visualizes pipelines running in production, monitors progress, and troubleshoots issues when needed. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command-line utilities make performing complex surgeries on DAGs a snap.

I think the most important feature of Airflow is that workflows are defined as code. In this way, they become more maintainable, versionable, testable, and collaborative. With Airflow, you can schedule a pipeline as complex as you want. The DAG is created with code and not with GUI tools.

But what is a DAG? DAG (Directed Acyclic Graph) is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. Let me give an example: a simple DAG could consist of three tasks: A, B, and C. It could say that A has to run successfully before B can run, but C can run anytime. It could say that task A times out after 5 minutes, and B can be restarted up to 5 times in case it fails. It might also say that the workflow will run every night at 10 pm, but shouldn't start until a certain date. In this way, a DAG describes how you want to carry out your workflow; but notice that we haven’t said anything about what we actually want to do! A, B, and C could be anything. Maybe A prepares data for B to analyze while C sends an email.

[caption id="" align="aligncenter" width="580"]DAG example [http://michal.karzynski.pl] DAG example [http://michal.karzynski.pl][/caption]

DAGs are written in python, so if B is a clustering algorithm like DBSCAN that clusters the data prepared at stage A; you can use every machine learning library (as Scikit-learn, for example) that helps for this task. Airflow implements the python operator (and much more) that runs a defined python function, and I think this is very useful to easily implement a machine learning workflow, as we can see in this example: the scikit-learn demo of K-Means clustering on the handwritten digits data.

You can find all the code here. In this repo, here is everything we need to run the example:

  • dockerfile for Airflow with scikit-learn library
  • docker-compose to set-up the environment
  • DAG that schedules the example of k-means algorithms.

In the following code, we can see the DAG runs the scikit-learn k-means example.

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from algorithm import test

default_args = {
    "owner": "airflow",
    "depends_on_past": False,
    "start_date": datetime(2015, 6, 1),
    "email": ["airflow@airflow.com"],
    "email_on_failure": False,
    "email_on_retry": False,
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
    # 'queue': 'bash_queue',
    # 'pool': 'backfill',
    # 'priority_weight': 10,
    # 'end_date': datetime(2016, 1, 1),
}


dag = DAG('k-means', default_args=default_args, schedule_interval=timedelta(1))

PythonOperator(dag=dag,
               task_id='k-means',
               provide_context=False,
               python_callable=test.my_test
               #op_args=['arguments_passed_to_callable'],
               #op_kwargs={'keyword_argument':'which will be passed to function'}
               )

As we can see, the setup is very simple, and the Airflow interface is very clear and easy to learn.

Conclusion

I think that airflow is a very powerful and easy-to-use tool that enables really fast research to the production process for an ML algorithm. With Docker and Docker Compose, the environment setup is very easy and repeatable, and it can be shared with all the data scientists on your team. In this way, the data scientists can run, on their own laptops, the model in the same way the model will be run in the production environment. Really cool, don’t you think?

Topics:
machine learning ,data engineering ,artificial intelligence ,airflow ,tutorial ,ml algorithms

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}