DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Leveraging Apache Airflow on AWS EKS (Part 3): Advanced Topics and Practical Use Cases
  • Leveraging Apache Airflow on AWS EKS (Part 2): Implementing Data Orchestration Solutions
  • Leveraging Apache Airflow on AWS EKS (Part 1): Foundations of Data Orchestration in the Cloud
  • A Comprehensive Comparison of AWS Step Functions and AWS MWAA

Trending

  • Medallion Architecture: Why You Need It and How To Implement It With ClickHouse
  • A Complete Guide to Modern AI Developer Tools
  • Event-Driven Architectures: Designing Scalable and Resilient Cloud Solutions
  • Automatic Code Transformation With OpenRewrite
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Orchestrating dbt Workflows: The Duel of Apache Airflow and AWS Step Functions

Orchestrating dbt Workflows: The Duel of Apache Airflow and AWS Step Functions

This article discusses the utilization of Apache Airflow and AWS Step Functions for orchestrating data pipelines with dbt (data build tool).

By 
Suhas Jangoan user avatar
Suhas Jangoan
·
Feb. 22, 24 · Analysis
Likes (3)
Comment
Save
Tweet
Share
5.3K Views

Join the DZone community and get the full member experience.

Join For Free

Think of data pipeline orchestration as the backstage crew of a theater, ensuring every scene flows seamlessly into the next. In the data world, tools like Apache Airflow and AWS Step Functions are the unsung heroes that keep the show running smoothly, especially when you're working with dbt (data build tool) to whip your data into shape and ensure that the right data is available at the right time. Both tools are often used alongside dbt (data build tool), which has emerged as a powerful tool for transforming data in a warehouse. 

In this article, we will introduce dbt, Apache Airflow, and AWS Step Functions and then delve into the pros and cons of using Apache Airflow and AWS Step Functions for data pipeline orchestration involving dbt. A note that dbt has a paid version of dbt cloud and a free open source version; we are focussing on dbt-core, the free version of dbt.

dbt (Data Build Tool)

dbt-core is an open-source command-line tool that enables data analysts and engineers to transform data in their warehouses more effectively. It allows users to write modular SQL queries, which it then runs on top of the data warehouse in the appropriate order with respect to their dependencies. 

Key Features

  • Version control: It integrates with Git to help track changes, collaborate, and deploy code.
  • Documentation: Autogenerated documentation and a searchable data catalog are created based on the dbt project.
  • Modularity: Reusable SQL models can be referenced and combined to build complex transformations.

Airflow vs. AWS Step Functions for dbt Orchestration

Apache Airflow

Apache Airflow is an open-source tool that helps to create, schedule, and monitor workflows. It is used by data engineers/ analysts to manage complex data pipelines.

Key Features

  • Extensibility: Custom operators, executors, and hooks can be written to extend Airflow’s functionality.
  • Scalability: Offers dynamic pipeline generation and can scale to handle multiple data pipeline workflows.

Example: DAG

Shell
 
from airflow import DAG

from airflow.operators.bash_operator import BashOperator

from datetime import datetime, timedelta

default_args = {

    'owner': 'airflow',

    'depends_on_past': False,

    'start_date': datetime.now() - timedelta(days=1),

    'email_on_failure': False,

    'email_on_retry': False,

    'retries': 1,

    'retry_delay': timedelta(minutes=5),

}

dag = DAG('dbt_daily_job',

          default_args=default_args,

          description='A simple DAG to run dbt jobs',

          schedule_interval=timedelta(days=1))

dbt_run = BashOperator(

    task_id='dbt_run',

    bash_command='dbt build --s sales.sql',

    dag=dag,

)

slack_notify = SlackAPIPostOperator(

    task_id='slack_notify',

    dag=dag,

    # Replace with your actual Slack notification code

)

dbt_run >> slack_notify

Pros

  • Flexibility: Apache Airflow offers unparalleled flexibility with the ability to define custom operators and is not limited to AWS resources.
  • Community support: A vibrant open-source community actively contributes plugins and operators that provide extended functionalities.
  • Complex workflows: Better suited to complex task dependencies and can manage task orchestration across various systems.

Cons

  • Operational overhead: Requires management of underlying infrastructure unless managed services like Astronomer or Google Cloud Composer are used.
  • Learning curve: The rich feature set comes with a complexity that may present a steeper learning curve for some users.

AWS Step Functions

AWS Step Functions is a fully managed service provided by Amazon Web Services that makes it easier to orchestrate microservices, serverless applications, and complex workflows. It uses a state machine model to define and execute workflows, which can consist of various AWS services like Lambda, ECS, Sagemaker, and more. 

Key Features

  • Serverless operation: No need to manage infrastructure as AWS provides a managed service.
  • Integration with AWS Services: Seamless connection to AWS services is supported for complex orchestration.

Example: State Machine Cloud Formation Template (Step Function)

Shell
 
AWSTemplateFormatVersion: '2010-09-09'

Description: State Machine to run a dbt job


Resources:

  DbtStateMachine:

    Type: 'AWS::StepFunctions::StateMachine'

    Properties:

      StateMachineName: DbtStateMachine

      RoleArn: !Sub 'arn:aws:iam::${AWS::AccountId}:role/service-role/StepFunctions-ECSTaskRole'

      DefinitionString:

        !Sub |

          Comment: "A Step Functions state machine that executes a dbt job using an ECS task."

          StartAt: RunDbtJob

          States:

            RunDbtJob:

              Type: Task

              Resource: "arn:aws:states:::ecs:runTask.sync"

              Parameters:

                Cluster: "arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:cluster/MyECSCluster"

                TaskDefinition: "arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:task-definition/MyDbtTaskDefinition"

                LaunchType: FARGATE

                NetworkConfiguration:

                  AwsvpcConfiguration:

                    Subnets:

                      - "subnet-0193156582abfef1"

                      - "subnet-abcjkl0890456789"

                    AssignPublicIp: "ENABLED"

              End: true


Outputs:

  StateMachineArn:

    Description: The ARN of the dbt state machine

    Value: !Ref DbtStateMachine

When using AWS ECS with AWS Fargate to run dbt workflows, while you can define the dbt command in DbtTaskdefinition, it's also common to create a Docker image that contains not only the dbt environment but also the specific dbt commands you wish to run.

Pros

  • Fully managed service: AWS manages the scaling and operation under the hood, leading to reduced operational burden.
  • AWS integration: Natural fit for AWS-centric environments, allowing easy integration of various AWS services.
  • Reliability: Step Functions provide a high level of reliability and support, backed by AWS SLA.

Cons

  • Cost: Pricing might be higher for high-volume workflows compared to running your self-hosted or cloud-provider-managed Airflow instance. Step functions incur costs based on the number of state transitions.
  • Locked-in with AWS: Tightly coupled with AWS services, which can be a downside if you're aiming for a cloud-agnostic architecture.
  • Complexity in handling large workflows: While capable, it can become difficult to manage larger, more complex workflows compared to using Airflow's DAGs. There are limitations on the number of parallel executions of a State Machine.
  • Learning curve: The service also presents a learning curve with specific paradigms, such as the Amazon States Language.
  • Scheduling: AWS Step functions need to rely on other AWS services like AWS Eventbridge for scheduling.

Summary

Choosing the right tool for orchestrating dbt workflows comes down to assessing specific features and how they align with a team's needs. The main attributes that inform this decision include customization, cloud alignment, infrastructure flexibility, managed services, and cost considerations.

Customization and Extensibility

Apache Airflow is highly customizable and extends well, allowing teams to create tailored operators and workflows for complex requirements.

Integration With AWS

AWS Step Functions is the clear winner for teams operating solely within AWS, offering deep integration with the broader AWS ecosystem.

Infrastructure Flexibility

Apache Airflow supports a wide array of environments, making it ideal for multi-cloud or on-premises deployments.

Managed Services

Here, it’s a tie. For managed services, teams can opt for Amazon Managed Workflows for Apache Airflow (MWAA) for an AWS-centric approach or a vendor like Astronomer for hosting Airflow in different environments. There are also platforms like Dagster that offer similar features to Airflow and can be managed as well. This category is highly competitive and will be based on the level of integration and vendor preference.

Cost at Scale

Apache Airflow may prove more cost-effective for scale, given its open-source nature and the potential for optimized cloud or on-premises deployment. AWS Step Functions may be more economical at smaller scales or for teams with existing AWS infrastructure.

Conclusion

The choice between Apache Airflow and AWS Step Functions for orchestrating dbt workflows is nuanced.

For operations deeply rooted in AWS with a preference for serverless execution and minimal maintenance, AWS Step Functions is the recommended choice. 

For those requiring robust customizability, diverse infrastructure support, or cost-effective scalability, Apache Airflow—whether self-managed or via a platform like Astronomer or MWAA (AWS-managed)—emerges as the optimal solution.

AWS Apache Airflow Data build tool

Opinions expressed by DZone contributors are their own.

Related

  • Leveraging Apache Airflow on AWS EKS (Part 3): Advanced Topics and Practical Use Cases
  • Leveraging Apache Airflow on AWS EKS (Part 2): Implementing Data Orchestration Solutions
  • Leveraging Apache Airflow on AWS EKS (Part 1): Foundations of Data Orchestration in the Cloud
  • A Comprehensive Comparison of AWS Step Functions and AWS MWAA

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!