DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Airflow to Orchestrate Machine Learning Algorithms

Airflow to Orchestrate Machine Learning Algorithms

This post suggests a possible, quick-to-implement solution for managing, scheduling, and running workflows.

Claudio Masolo user avatar by
Claudio Masolo
·
Apr. 02, 19 · Tutorial
Like (4)
Save
Tweet
Share
19.24K Views

Join the DZone community and get the full member experience.

Join For Free

as a data engineer, a big challenge is to manage, schedule, and run workflows to prepare data, generate reports, and run algorithms. this post suggests a possible, quick-to-implement solution for these activities with a simple example. image title

managing and scheduling a data analysis workflow can be done in a lot of ways, but the more common ways are:

  • cron job, directly on the operating system
  • jenkins

both ways don't scale well with the dimensions of the jobs: if a job fails at one of the stages of the workflow, your workflow needs to be restarted from scratch. if you want to create a better data pipeline with an easy-to-learn interface and a lot of useful features, you need to use a computational orchestrator like apache airflow.

airflow is computational orchestrator because you can manage every kind of operation if you can write a workflow for that. this means that you can use airflow to author workflows as directed acyclic graphs (dags).

airflow is composed of two elements: web server and scheduler.

a web server runs the user interface and visualizes pipelines running in production, monitors progress, and troubleshoots issues when needed. the airflow scheduler executes your tasks on an array of workers while following the specified dependencies. rich command-line utilities make performing complex surgeries on dags a snap.

i think the most important feature of airflow is that workflows are defined as code. in this way, they become more maintainable, versionable, testable, and collaborative. with airflow, you can schedule a pipeline as complex as you want. the dag is created with code and not with gui tools.

but what is a dag? dag (directed acyclic graph) is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. let me give an example: a simple dag could consist of three tasks: a, b, and c. it could say that a has to run successfully before b can run, but c can run anytime. it could say that task a times out after 5 minutes, and b can be restarted up to 5 times in case it fails. it might also say that the workflow will run every night at 10 pm, but shouldn't start until a certain date. in this way, a dag describes how you want to carry out your workflow; but notice that we haven’t said anything about what we actually want to do! a, b, and c could be anything. maybe a prepares data for b to analyze while c sends an email.

[caption id="" align="aligncenter" width="580"] dag example [http://michal.karzynski.pl] dag example [http://michal.karzynski.pl][/caption]

dags are written in python, so if b is a clustering algorithm like dbscan that clusters the data prepared at stage a; you can use every machine learning library (as scikit-learn, for example) that helps for this task. airflow implements the python operator (and much more) that runs a defined python function, and i think this is very useful to easily implement a machine learning workflow, as we can see in this example: the scikit-learn demo of k-means clustering on the handwritten digits data.

you can find all the code here. in this repo, here is everything we need to run the example:

  • dockerfile for airflow with scikit-learn library
  • docker-compose to set-up the environment
  • dag that schedules the example of k-means algorithms.

in the following code, we can see the dag runs the scikit-learn k-means example.

from airflow import dag
from airflow.operators.python_operator import pythonoperator
from datetime import datetime, timedelta
from algorithm import test

default_args = {
    "owner": "airflow",
    "depends_on_past": false,
    "start_date": datetime(2015, 6, 1),
    "email": ["airflow@airflow.com"],
    "email_on_failure": false,
    "email_on_retry": false,
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
    # 'queue': 'bash_queue',
    # 'pool': 'backfill',
    # 'priority_weight': 10,
    # 'end_date': datetime(2016, 1, 1),
}


dag = dag('k-means', default_args=default_args, schedule_interval=timedelta(1))

pythonoperator(dag=dag,
               task_id='k-means',
               provide_context=false,
               python_callable=test.my_test
               #op_args=['arguments_passed_to_callable'],
               #op_kwargs={'keyword_argument':'which will be passed to function'}
               )

as we can see, the setup is very simple, and the airflow interface is very clear and easy to learn.

conclusion

i think that airflow is a very powerful and easy-to-use tool that enables really fast research to the production process for an ml algorithm. with docker and docker compose, the environment setup is very easy and repeatable, and it can be shared with all the data scientists on your team. in this way, the data scientists can run, on their own laptops, the model in the same way the model will be run in the production environment. really cool, don’t you think?

Machine learning Algorithm Data science Scikit-learn workflow Docker (software)

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • GitLab vs Jenkins: Which Is the Best CI/CD Tool?
  • Spring Cloud
  • HTTP vs Messaging for Microservices Communications
  • Cucumber.js Tutorial With Examples For Selenium JavaScript

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: