Monitoring Performance Across a Data Pipeline
Troubleshooting performance issues that occur in a real-time big data pipeline.
Join the DZone community and get the full member experience.Join For Free
In my last post, I outlined a framework to monitor the performance of data processing frameworks like Apache Storm, Spark, Kafka etc. These data frameworks are the building blocks for data-first applications and data pipelines. Now, we turn our attention to monitoring and troubleshooting across a real time data pipeline. As discussed in previous articles, monitoring data pipelines is hard, for a number of reasons, especially when it comes to correlating common concerns across different components in a pipeline. To recap, in a data pipeline you might be pumping data into KAFKA from your application servers, processing them with STORM or SPARK, in real time, then move them into a data store like HDFS/S3, Cassandra, or Elasticsearch.)
The primary concern for any data processing pipeline is the health of data flowing through the system. The overall health of a pipeline should be evaluated as combination of multiple attributes such as throughput, latency and error rates. Each of these attributes may be dependent on a combination of multiple metrics and measures that are specific to each pipeline and the respective stages within the pipeline. For example:
A real-time search index generation pipeline leveraging Elasticsearch might measure indexedDocsPerSec and avgDocSize as throughput metrics, indexFreshness as a latency measure and indexErrorCount as an error measure. In an e-commerce analytics pipeline, throughput measures may include checkoutsPerMinute, latency may be avgCheckoutDuration. A error metric may be abandonedCartsPerMinute.
However, regardless of the business application, the common concerns are still throughput, latency, and error, and when you wish to inspect the health of the pipeline, these are the primary concerns you should care about. Unless you instrument, annotate and organize your telemetry accordingly, it can be very difficult to separate your primary concerns from other infrastructure metrics such as cpu utilization, disk space and so forth.Let’s take an example pipeline that takes user click events and generates an analytics report at the end. It might look something similar to the figure below:
A data pipelines that processes user click events.
In a nutshell, clickstreams captured from the web get sent to a load balancer (ELB), to a REST gateway (Play app), to a message broker (Kafka), to a data processor (Spark), into a data store (ElasticSearch), which finally gets served to the user through a web app (Rails). Let’s say that you notice a report either isn’t being generated correctly or the data looks wrong. Where do you start the debugging process? How long will it take to figure this out? How many places would you have to look to solve the problem?
The right approach is to look at your primary health concerns and drill down from there. Let’s take throughput as an example. If you have all the throughput metrics for this pipeline laid out in front of you in logical order, you’d be able to detect at which stage throughput became abnormal and start debugging from there. The same could be true for your latency metrics across the pipeline or for error rates. But, that’s often easier said than done. If you’re like most organizations, the tools you have available often look something like this
A sample collection of monitoring tools that span a typical data pipeline
Where are your throughput metrics? Probably somewhere in there, but how long is it going to take you to find, much less correlate them? Alternatively, if you were to collect the relevant service and system graphs for each cluster into one dashboard, perhaps using a tool like ganglia or various other commercial graphing/dashboard offerings, it will still be difficult to distinguish signal from noise. If, on the other hand, you go one step further and organize your data by your primary data pipeline concerns, then figuring out where to start looking can be almost trivial. In the example below, where we collect and organize our throughput metrics for our example data pipeline, spotting the problem is relatively easy.
Key throughput metrics organized together in order for the data pipeline
In this case, the play application component is the most likely source of the problem because all the throughput metrics upstream look normal and all of the throughput metrics downstream show the same abnormal behavior. The play cluster is the stage where the abnormality starts. You can do the same with concerns such as latency, error rate, etc.
This is all great, but it’s a non-trivial amount of work to collect data for every single component (6 in this simple example), then figure out which metrics map to which concerns, and finally putting them together in multiple dashboards, one for each pipeline and concern. Most companies would prefer to focus on building their core business applications rather than to spend the person-months of engineering effort to build data collection plugins, configuration scripts and dashboards required. Then, there is the need to invest time to fine tune the system when the initial iterations inevitably don't have the information they need. As a result, many companies opt to monitor just a few metrics, perhaps at the output end of the data pipeline, and call it a day. When things do go wrong, multiple engineers will inevitably spend hours manually trudging through multiple dashboards, logs, alerts, etc., trying to figure out what really should be obvious if the data is organized well in the first place.
These days, real time data pipelines are becoming more and more mission critical. An interruption to the pipeline may very well affect the bottom line. For instance, what is the cost of interrupting a fraud detection pipeline? How about placing ads incorrectly due to bad data? What if you gave your users stale recommendations and alerts? If you have a product or service that depends on the data processed through one or more data pipelines, it’s probably not okay for you not to get a handle on this. The effort you invest in organizing your data pipeline telemetry appropriately will be well worth the cost. If you’d like help doing that, well these are precisely the sorts of problems that the OpsClarity Platform was built to solve.
Posted on behalf of Alan Ngai.
Opinions expressed by DZone contributors are their own.