Large-scale data processing has moved into a new age of sophistication and the greatest evidence is the increasing requirement for processing data in real time. The drivers for this move are many, but the basic impetus can be understood by simply recalling the classic Time-Value-of-Data concept and how the value of data is at its peak within the first few minutes of creation, then rapidly tapers off over time. Real-time data processing is an essential capability in more and more use cases. Fortunately, there are numerous real-time data processing frameworks available, such as Apache Spark, Storm, Heron, Flink, Apex, Kafka Streams and others. But while each framework provides a new set of important possibilities, it also adds substantial complexities and operational challenges.
In this post, we will discuss the processing of streaming data using one of the most popular data processing technologies, Apache Spark. This discussion will go in-depth into how you should think about monitoring your Spark data pipelines. But first, let’s step back and look at the basics.
Spark is a general purpose engine for fast, large scale data processing, that is seeing wide adoption in the batch and streaming worlds. Before understanding the best ways to monitor Spark when running applications, it is important to understand the execution model and the various components in the Spark architecture.
- Spark Manager/Master manages cluster resources. It can be run in one of the following modes:
- Standalone: A cluster manager included with Spark for easy setup.
- Mesos: Abstracts system resources and makes them available as an elastic distributed system.
- Yarn: The default resource manager starting Version 2 of Hadoop.
- Spark Worker: In standalone mode, the workers are processes running on individual nodes that manage resource allocation requests for that node and also monitor the executors.
- Spark Executor: An executor is a process that is launched alongside the worker for every application that is created in the cluster. It does the actual work of running tasks inside of it.
- Spark Driver: A process that creates the SparkContext and drives the overall application from start to finish.
The Spark Driver creates a Spark Context and communicates with the Manager/Master to acquire resources on the worker nodes. The workers then create executor processes locally for every spark application. Once this is complete, the driver directly communicates with the Executors to get the work done. The workers are constantly monitoring the Executors for failures. A DAGScheduler in the driver internally breaks down the computation plan into stages and tasks which finally gets executed on the executors. The executors for an application are alive as long as the driver is alive.
As you can see, there are many components coming together to make a spark application work. If you’re planning to deploy Spark in your production environment, you want to make sure you can monitor the different components, understand performance parameters, get alerted when things go wrong and know how to troubleshoot issues. To achieve the level of confidence and maturity so you can actively deploy and run in your production environment, you need to be thinking more than just collecting and graphing various metrics. This blog and the next one in this series will attempt at describing the challenges and best practices to monitor.
Why Is Monitoring Spark Streaming Challenging?
The question seems simple, but is probably the most difficult question you need to answer. The Spark UI provides a basic dashboard but this is not sufficient if you’re thinking of moving to a production ready setup. Without any insight into Spark’s inner workings and its components, you’re flying blind. So, what should your monitoring setup look like?
Let’s break down the problem into more understandable chunks. You should think of the problem as monitoring at three levels:
- The spark infrastructure consisting of
- Applications that run on the Spark infrastructure—
- Underlying hosts–Disks, CPU, Network.
Once you have broken down the monitoring problem into the three levels shown above you will also see how these are interdependent. Anything that affects a host, say a disk failure, will propagate the failure through the spark infrastructure and finally to the application. Setting up a correlated view of health across the three levels is critical. You need a monitoring system that can understand these interdependencies and help you visually correlate the problem on the application to an issue on the infrastructure and finally to an issue on the host. Without this sophisticated correlation, you’ll be spending hours trying to debug the root cause of an issue when things go awry.
As an example of a software service that would aid this process, OpsClarity has developed a concept of health of a service that automatically discovers the entire service topology of data pipeline and applications. The interface has red, green, and orange circles that correspond to health. It makes it easier to respond to problems quicker, which is the point of a monitoring system. See the image below:
How Should I Configure My Monitoring?
Spark provides metrics for each of the above components through different endpoints. For example, if you want to look at the Spark driver details, you need to know the exact URL, which keeps changing over time–Spark keeps you guessing on the URL. The typical problem is when you start your driver in cluster mode. How do you detect on which worker node the driver was started? Once there, how do you identify the port on which the Spark driver exposes its UI? This seems to be a common annoying issue for most developers and DevOps professionals who are managing Spark clusters. In fact, most end up running their driver in client mode as a workaround, so they have a fixed URL endpoint to look at. However, this is being done at the cost of losing failover protection for the driver. Your monitoring solution should be automatically able to figure out where the driver for your application is running, find out the port for the application and automatically configure itself to start collecting metrics.
For a dynamic infrastructure like Spark, your cluster can get resized on the fly. You must ensure your newly spawned components (Workers, executors) are automatically configured for monitoring. There is no room for manual intervention here. You shouldn’t miss out monitoring newer processes that show up on the cluster. On the other hand, you shouldn’t be generating false alerts when executors get moved around. A general monitoring solution will typically start alerting you if an executor gets killed and starts up on a new worker–this is because generic monitoring solutions just monitor your port to check if it’s up or down. With a real time streaming system like Spark, the core idea is that things can move around all the time.
Consider a check pointed Spark application. In case of a driver failure, the application can come back with the same application context alongside another worker. In this case, the metric collection, port checks and other monitors that were configured should propagate to the new driver node and keep the configuration intact. Would your monitoring solution be able to handle such scenarios? Simply collecting metrics and displaying them on dashboards will not work for data centric systems like Spark.
Topology discovery and auto-configuration features detect and configure Spark's dynamically changing infrastructure in real-time. No manual intervention is required for new workers or executors.
You need to be 100 percent confident of your monitoring setup and understand how your monitoring would behave with such a dynamic infrastructure. Make sure you think through each of the cases stated above for monitoring. Else, when hell breaks loose, you lose.
In the next part of Monitoring Apache Spark series, we will take a look at the key Spark metrics, their behavior patterns and how to effectively troubleshoot a complex and distributed Spark Streaming application.