Building a Fault Tolerant Data Pipeline
See what goes into building fault tolerant infrastructure so your system can be prepared to stand up to shocks.
Join the DZone community and get the full member experience.Join For Free
Giving end users immediate access to operational data for real-time business intelligence can help drive better decisions. This requires us to gather data fast and analyze it in real-time. Data sources are varied and the rate of flow of data may or may not be constant. Our system needs to be robust and should be able to take care of such scenario. This document talks about architectural consideration for working on such use case.
Creating the Pipeline
Let's try to build data pipeline for log analysis. Say we have a distributed application that supports millions of user. As is the case for any application system, sometimes we have spikes in user activity. So our log analytics framework needs to take care of such scenarios.
Fig1: Example of a Distributed System
For the sake of this problem, let's assume we choose Logstash as our forwarder, Kafka as the queue, Spark as the analytics engine and Elasticsearch as our datastore. Our architecture looks something like:
Fig2: Basic Architecture of log analytics framework
We now have data pipeline in place. It takes cares of spikes thanks to Kafka. But still, we need to work on fault tolerance. We still have point of failures. Let’s address these issues because the failure of a particular module should not result in data loss.
Failure at Logstash
Logstash is a forwarder that sends data from log files to Kafka. It has to keep track of the last line it read to be able to forward any newly written log line. So, data loss at this place can be avoided. Even if Logstash goes down, it has a pointer to the last line it read from the log file. But we still need to have the monitoring capacity to detect such shutdowns.
Failure at Kafka:
Kafka is a distributed publish-subscribe messaging system. It was originally developed at LinkedIn and later on became a part of the Apache project. Kafka is fast, scalable, and distributed in nature by its design as partitioned and replicated commit log service.
It is important to note Kafka can be one major component that can be leveraged in fault tolerance, as Kafka can hold data for configurable time intervals. Even if the pipeline ahead of Kafka goes down, we will have data stored in the Kafka queue.
Before going to Zookeeper, let us first understand how service discovery works:
- Service discovery will maintain a list of active services and their statuses.
- The service registry is a key part of service discovery. It is a database containing the network locations of service instances. A service registry needs to be highly available and up to date. Clients can cache network locations obtained from the service registry. However, that information eventually becomes out of date and clients become unable to discover service instances. Consequently, a service registry consists of a cluster of servers that use a replication protocol to maintain consistency.
- In service discovery, the network location of a service instance is registered with the service registry when it starts up. It is removed from the service registry when the instance terminates. The service instance’s registration is typically refreshed periodically using a heartbeat mechanism.
Kafka can have multiple brokers, and Zookeeper can keep track of active brokers. So, all communication to Kafka should go through Zookeeper.
Fig. 3: Zookeeper’s functions in Kafka.
It is still better to have a standby Kafka installation.
Failure at Spark
Spark can be used for alerting and outlier detection. In Spark, every component other than the job manager is fault tolerant. We need to have a hot standby master node. Here, service discovery is useful, too.
Fig 4. Probable Spark deployment
We should also still have a cold standby Spark cluster.
Failure at Elasticsearch
The deployment of Elasticsearch can look something like:
Fig. 5: Probable Elasticsearch deployment
For stability and the best performance of the Elasticsearch cluster, based on Elasticsearch's recommendation, we should keep a spare node as a master node. It can be done by making “data=false” and “master=true” in the config file.
Keep a couple of non-master and non-data nodes for serving HTTP requests. That will also help if master node goes down.
So deployment will look like:
Fig. 6 : Fault tolerant deployment
Though service discovery is based on heartbeat, we're still better off with monitoring .
Monitoring and Fault Detection
For monitoring system status, we can use Nagios, Munin, and Monit based on our requirements. Nagios can send alerts in case of system failure, but it will not able to restart services and the graph is not informative. Munin can send alerts in case of system failure and the system info graphs are very useful, but it will not able to restart services. With Monit, we can get alerts, and we will also be able to rectify some of the common issues while also starting services.
Opinions expressed by DZone contributors are their own.