DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Self-Healing Data Pipelines: The Next Big Thing in Data Engineering?
  • What Is Reverse ETL? Overview, Use Cases, and Key Benefits
  • Big Data Realtime Data Pipeline Architecture
  • Offline Data Pipeline Best Practices Part 1:Optimizing Airflow Job Parameters for Apache Hive

Trending

  • Designing for Sustainability: The Rise of Green Software
  • Proactive Security in Distributed Systems: A Developer’s Approach
  • My Favorite Interview Question
  • Tired of Spring Overhead? Try Dropwizard for Your Next Java Microservice
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Data Pipeline Orchestration

Data Pipeline Orchestration

In this article, I share an idea of what a data pipeline orchestration model might look like.

By 
Ivan Nikitsenka user avatar
Ivan Nikitsenka
·
Aug. 22, 22 · Opinion
Likes (2)
Comment
Save
Tweet
Share
7.3K Views

Join the DZone community and get the full member experience.

Join For Free

The spectrum of tasks data scientists and engineers need to do today is tremendous. The best description of it you will find in this article: The AI Hierarchy of Needs. A long road of technical problems needs to be completed before you can start working on the real business problem and build valuable data products.  Most of the time is spent on infrastructure, data security, transition, processing, storing, provisioning, testing, and operational readiness/healthiness. 

As a data engineer, I need a single interface or operating system with composable capabilities provided in distributed hybrids environments and extensible for future implementations.

As a data engineer, I need a single interface or operating system with composable capabilities provided in distributed hybrids environments and extensible for future implementations.

Giving such a tool will help data scientists and engineers to keep the focus on business problem solving without diving into the technical complexity of the implementation details and operational readiness. 

GOAL: Create a unified data pipelines orchestration interface for data engineers.

What Is a Data Pipeline?

The data derived from the user and applications activities can be categorized into two groups:

  • Operational - internal system events and metrics. Used to optimize the performance and reliability of the application.
  • Analytics - historical data created by users while using the application. Used to optimize the user experience.     

This derived data brings many byproducts created by service operators, data engineers, and analytics such as KPI reports, monitoring dashboards, machine learning models, and so on. To create such data products, data need to be collected from the source, cleaned, processed as a stream, and send to a sink such as a data warehouse or data lake.

A Data Pipeline is a term used to describe a data flow consisting of reading, processing, and storage tasks that ingest, move, and transform raw data from one or more sources to a destination.

Each of the elements represents a task with resources responsible for one purpose in the pipeline body.


Each of the elements represents a task with resources responsible for one purpose in the pipeline body. For example, a source - provides capabilities to read data from other systems. It could be a Kafka connector, rest API poller, or webhook subscriber. It could also be a simple bash command in the Linux OS, but in the world of highly-intensive big data, it also needs resources provisioned in a distributed processing environment.

The element itself is responsible for: 

  • Input validation and acceptance (both from a user and other elements)
  • Resources provision
  • Tasks and services execution
  • State management (scheduling, retrying, error handling)
  • Output provision

There are also dependencies between them, and outputs from the single command or resource should be directed as input for other elements. Assuming the life cycle of components consists of different states(ex, provisioning, ready, healthy, error,) its handling requires a management process that can be automated.

Pipeline Orchestration

Single components by themselves don't solve complex data engineering problems. For example: to prepare data for ML model training, we likely need to read it from the source, validate and filter non-applicable outliers, perform aggregations with transformations and send it to the storage system. Orchestration is the process of composing or building complex structures from a single responsible block, element, or component.  

Capabilities of orchestration layer:

  • Connect components into workflows/chains of tasks/steps
  • Provisioning resources into downstream systems
  • Input/output and state management

Non-functional capabilities

  • Extensible 
  • Observable
  • Distributed processing in a hybrid environment

A similar analog is a system command line interface that allows you to combine single commands into a chain of functions via piping. The commands themselves are very restricted on responsibilities but quite stable and highly performant following Unix philosophy. Operating systems terminals in the early days were the single API or entry point for communication between users and technology. Today we have the same business needs but with data-intensive applications working in a distributed cloud environment. 

Orchestration Products Available Today

AI&Data landscape provided by Cloud Native Computing Foundation shows a huge number of tools available today for data engineers, and it becomes enormous if we also add DevOps tools and services from cloud providers and SaaS companies.  I'll be back with a detailed analysis and comparison of them later, maybe, if I have a time-stop machine :)  For now, let's look into Apache Airflow and Terraform as the most popular for workflow tasks and infrastructure orchestrations. 

Inside Apache Airflow, the DAG(Direct Acyclic Graph) is commonly used for connecting tasks into workflows. This solves the problem of connecting elements and defining dependencies used later for task execution. But how can the resources be provided in an external system? That is why Airflow has operators with a lot of providers available. This also solves the problem of extensibility. 

Using Terraform, a DevOps engineer can orchestrate complex structures and use a single API interface for its provisioning. The declarative model is used for creating infrastructure as code in comparison with the Airflow imperative programming model in Phyton. The same as Airflow, there is a provider delegating and extension mechanism with a lot of vendors available.

All these tools have different interfaces and can hardly be called simple. There is no single unified model of how the orchestration element should look like…

Orchestration Model

Using these orchestration tools, let's try to define what should be the unified structure of orchestration components.

When we want to create smth, we need blocks. There are so many words that can be used for such a single unit of build: a component, a task, a block, a node, an object, a function, an entity, a brick, a resource, an atom, a quantum and so on. Having a block as the main build unit is used in many creative engineering games: For example, Minecraft, Lego bricks, or Puzzles.

Minecraft, Lego bricks, or Puzzles.


Let's imagine we have such an element for data pipeline orchestration.  What should it consist of? 

JSON
 
{

 "type": "source",

 "version": "0.0.1",

 "name": "exampleHttpSource",

 "provider": "AWS",

 "extension": "S3SourceExtension",

 "state": "READY",

 "configuration": {},

 "outputs": {},

 "dependsOn": []

}

Type -  is the client's property to help easily define the purpose or behavior of elements (source, stream, transform, sink, schema, etc.). 

Provider - is a Cloud provider, SaaS service vendor, or downstream system o for delegating resources provisioning. 

Extension - is an implementation class with all logic for resource creation. The idea for extensions comes from the microkernel pattern.

Configuration - is a placeholder for all properties required to provide the element in the downstream system or execute a command.

The purpose of the element is not only to provision the resource but also to provide feedback about it. The state will show if it was created successfully, is still in progress, or if some error happened. 

At each step of the element life cycle, some outputs might be generated, such as id, name, or error message. All this meta information can be found in the outputs section.

The information in the element should be enough for clients to understand the purpose and how to use it. At the same time, the technical specifications encapsulated in the implementation class and configuration section will give providers instructions on how to execute it.

The life of a data scientist is hard today, and it takes more time to solve technical problems than business. The orchestration tools such as Airflow and Terraform help a lot, however, there is no single interface or standard of how the orchestration model should look like. In this article, I shared an idea of what such a unified model might look like. This might be helpful for data engineers and architects who want to build domain-driven solutions abstracted and agnostic to the fast-growing and changing development environment. It isn't a complete solution to the problem we have today but an intention towards where we can be tomorrow. 

Big data Pipeline (software)

Opinions expressed by DZone contributors are their own.

Related

  • Self-Healing Data Pipelines: The Next Big Thing in Data Engineering?
  • What Is Reverse ETL? Overview, Use Cases, and Key Benefits
  • Big Data Realtime Data Pipeline Architecture
  • Offline Data Pipeline Best Practices Part 1:Optimizing Airflow Job Parameters for Apache Hive

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!