Over a million developers have joined DZone.

Navigating the Distributed Data Pipelines: An Overview and Guide for Your Performance Management Strategy

DZone's Guide to

Navigating the Distributed Data Pipelines: An Overview and Guide for Your Performance Management Strategy

An overview a modern big data stack, the requirements at both the individual application level of these stacks, and architectures for these complex environments.

· Big Data Zone ·
Free Resource

The Architect’s Guide to Big Data Application Performance. Get the Guide.

This article is featured in the new DZone Guide to Big Data: Volume, Variety, and Velocity. Get your free copy for insightful articles, industry stats, and more!

There are more than 10,000 enterprises across the globe that rely on a data stack that is made up of multiple distributed systems. While these enterprises, which span a wide range of verticals — finance, healthcare, technology, and more — build applications on a distributed big data stack, some are not fully aware of the performance management challenges that often arise. This piece will provide an overview of what a modern big data stack looks like, then address the requirements at both the individual application level of these stacks (as well as holistic clusters and workloads), and explore what type of architecture can provide automated solutions for these complex environments.

The Big Data Stack and Operational Performance Requirements

Areas like health care, genomics, financial services, self-driving technology, government, and media are building mission-critical applications in what's known as the big data stack. The big data stack is unique in that it is composed of multiple distributed systems. While every organization varies in how they deploy the technology, the big data stack in most enterprises goes through the following evolution:

  • ETL: Storage systems, such as HDFS, S3, and Azure Blob Store (ABS), house the large volumes of structured, semi-structured, and unstructured data. Distributed processing engines, like MapReduce, come in for the extraction, cleaning, and transformation of the data.
  • BI: SQL systems like Impala, Presto, LLAP, Drill, BigQuery, RedShift, or Azure SQL DW are added to the stack; sometimes alongside incumbent MPP SQL systems like Teradata and Vertica. Compared to the traditional MPP systems, the newer ones have been built to deal with data stored in a different distributed storage system like HDFS, S3, or ABS. These systems power the interactive SQL queries that are common in BI workloads.
  • Data science: With the continued maturation of the big data stack, it starts bringing in more data science workloads that leverage machine learning and AI. This stage is usually when the Spark distributed system starts to be used more and more.
  • Data streaming: Over time, enterprises begin to understand the importance of making data-driven decisions in near real-time, as well as how to overcome the challenges in implementing them. Usually, at this point in the evolution, systems like Kafka, Cassandra, and HBase are added to the big data stack to support applications that ingest and process data in a continuous streaming fashion.

Big data stack

Evolution of the big data stack in an enterprise.

With so many enterprises worldwide running applications in production on a distributed big data stack (comprised of three or more distributed systems), performance challenges are no surprise. The stacks have many moving parts, which makes it very hard to get any answers in the event that something goes wrong. When there's a breakdown, organizations often find themselves scrambling to understand the following:

  • Failure: What caused this application to fail, and how can I fix it?
  • Stuck: This application seems to have made little progress in the last hour. Where is it stuck?
  • Runaway: Will this application ever finish, or will it finish in a reasonable amount of time?
  • SLA: Will this application meet its SLA?
  • Change: Is the behavior (e.g., performance, resource usage) of this application very different from the past? If so, in what way and why?
  • Rogue/victim: Is this application causing problems on my cluster or is the performance of this application being affected by one or more other applications?

Many operational performance requirements are needed at the "macro" level compared to the level of individual applications. These include:

  • Configuring resource allocation policies to meet SLAs in multi-tenant clusters.
  • Detecting rogue applications that can affect the performance of SLA-bound applications through a variety of low-level resource interactions.
  • Configuring the hundreds of configuration settings that distributed systems are notoriously known for having to get the desired performance.
  • Tuning data partitioning and storage layout.
  • Optimizing dollar costs on the cloud.
  • Capacity planning using predictive analysis to account for workload growth proactively.

Overview of a Performance Management Strategy

Big data stack

Architecture of a performance management platform for the big data stack.

Suffice to say, distributed big data stacks come with many inherent challenges. To address these challenges and tame highly distributed big data deployments, organizations need an approach to application performance management that delivers all of the following:

  • Full data stack collection: To answer questions — such as, “what caused this application to fail?” or "will this application ever meet its SLA?" — monitoring data from every level of the stack will be necessary. This includes data from SQL queries, execution plans, data pipeline dependency graphs, and logs from the application level; resource allocation and wait-time metrics from the resource management and scheduling level; and actual CPU, memory, and network usage metrics from the infrastructure level, among other sources. Collecting such data in a non-intrusive or low-overhead manner from production clusters remains a major technical challenge, but this challenge is being addressed by the database and systems community.
  • Event-driven data processing: Large big data stacks can include over 500 nodes and run hundreds of thousands of applications every day across ETL, BI, data science, and streaming systems. These deployments generate tens of terabytes of logs and metrics every day. This data introduces two unique challenges — variety and consistency — which are further outlined below. To solve these two problems, the data processing layer has to be based on event-driven processing algorithms whose outputs converge to the same final state, irrespective of the timeliness and order in which the monitoring data arrives. The end user should get the same insights irrespective of the timeliness and order in which the monitoring data arrives.
  • The variety challenge: The monitoring data collected from the big data stack covers the full spectrum from unstructured logs to semi-structured data pipeline dependency DAGs to structured time-series metrics. Stitching this data together to create meaningful and useable representations of application performance is a nontrivial challenge.
  • The consistency challenge: Monitoring data has to be collected independently and in real-time from various moving parts of the multiple distributed systems that comprise the big data stack. Thus, no prior assumptions can be made about the timeliness or order in which the monitoring data arrives at the processing layer.
  • Machine learning-driven insights and policy-driven actions: Enabling all of the monitoring data to be collected and stored in a single place opens up interesting opportunities to apply statistical analysis and learning algorithms to this data. These algorithms can generate insights that, in turn, can be applied manually by the user or automatically based on configured policies to address the performance requirements identified earlier.


The modern big data stack faces many unique performance management challenges. These challenges exist at the individual application level, as well as at the workload and cluster levels. To solve these problems at every level, a performance management strategy needs to offer full stack data collection, event-driven data processing, and AI-driven insights and policy-driven actions. As organizations look to get more and more out of their big data stack — including the use of artificial intelligence and machine learning — it's imperative that they adopt a performance management approach that provides these key pillars.

This article is featured in the new DZone Guide to Big Data: Volume, Variety, and Velocity. Get your free copy for insightful articles, industry stats, and more!

Learn how taking a DataOps approach will help you speed up processes and increase data quality by providing streamlined analytics pipelines via automation and testing. Learn More.

big data ,big data stack ,data performance ,data pieline ,distributed data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}