“Fast data” is a huge buzzword in the current landscape. It's simply data that is not at rest. And since the data is not at rest, the traditional techniques of working on the data that is at rest are no longer efficient or relevant. The importance of streaming has grown, as it provides a competitive advantage that reduces the time gap between data arrival and information analysis. Business enterprises demand availability, scalability, and resilience as implicit characteristics in applications — and this is in addition to microservice architectures that need to learn how to deal with the requirement of dealing with real-time, fast data.
The integration of fast data processing tools and microservices has led to the fast data application. These applications process and extract value from data in near real-time. Technologies such as Apache Spark, Apache Kafka, and Apache Cassandra have grown to process that data faster and more effectively. These applications are unearthing real-time insights to drive profitability. But these applications pose a big challenge of monitoring and managing the overall system. The traditional techniques fail because they're based on monolithic applications and are unable to effectively manage the new distributed, clustered, tangled, and interconnected systems.
So What's the Issue?
The main challenge is how to ensure the continued health, availability, and performance of these modern, distributed fast data applications.
Let's get into a little more detail of how these applications actually pose a challenge. The possibility of the application to be streaming in data from more than a dozen sources is extremely high. These sources could be hundreds of individual, distributed microservices, data sources, and external endpoints. Once we have these sources, we have technologies such as Apache Spark, Apache Mesos, Akka, Apache Cassandra, and Apache Kafka (altogether known as the SMACK stack) in place to get a powerful data processing tool.
Now comes the point of contention in the system. These technologies are varied, distributed, and complex. They pose the following bottlenecks:
Evolving systems: The rapidly growing stack leads to a scarcity of domain knowledge as the number keeps growing. Understanding the business value behind them is a huge task.
Data pipeline: Having an in-depth understanding of the data pipeline is vital. Each stage involves input and output for another stage, and failure can occur at any time. This requires metrics computation on each stage of the pipeline.
Architecture: The process of manually monitoring or setting up the system has to be strict because the entire architecture of the system is highly dynamic and not static.
Complexity in interconnection: An error in one part of the system could be because of choking in some other part, but this is highly difficult to identify and debug.
Distribution and clusters: Each component in the system is deployed in a distributed manner. Imagine multiple frameworks each being deployed in a distributed manner over multiple nodes, working together and intertwined in the application. Debugging with traditional logging systems could be a nightmare. Correlating issues to understand dependencies and analyze root causes is difficult.
An overwhelming amount of information: With each framework in the system generating its own metrics, we have a flood of information.
Fast Data Monitoring vs. Traditional Monitoring
In a monolithic application, when an error occurs, the resolution is based on the monolithic design (comprised of the database layer, application layer, and front-end web layer). We get a clear call stack from beginning to end, allowing us to find the error quickly because the flow is very definitive and deterministic.
The challenge posed by fast data applications is entirely different. Here, the system is often asynchronous and composed of components like microservices, data frameworks, telemetry, machine learning, the streaming platform, etc. There are a few application performance monitoring (APM) tools available, but they have been rendered useless because of an inability to monitor asynchronous, streaming systems running on distributed clusters.
Various other tools like infrastructure monitoring tools, log analysis, and network performance monitoring tools have failed because of highly purpose-built and architectural interdependencies.
What Is the Solution and How Should It Look?
We need is an extremely insightful and powerful visualization layer that can understand and analyze the end-to-end health of the system, which includes the availability and performance of each of the app components. Now that we have analyzed the potential issue and we know the abstract solution, let's try to put some high-level implementations into the solution.
The first issue we could resolve is the overwhelming amount of information. For every component in the fast data application, we could organize the information into a hierarchy of concerns. These concerns might be:
Data health: Is the throughput of the component what we expected? Is the process meeting the timeframe requirements? Is the data stream posing any problems?
Dependency health: Are the dependent components (like memory cache or endpoints) healthy, or are they crossing the threshold?
Service health: Is the component able to distribute and rebalance the workloads effectively and efficiently?
Application health: Are the operating parameters under the normal thresholds or exceeding the values that can potentially affect the system adversely?
Topology health: If the resources in the distributed system are optimally utilized, are the performance parameters for the topology healthy?
Node system health: Are key parameters like load, CPU, memory, net I/O, disk I/O, and disk-free operating normally?
Other than the organization of information, there are other dimensions of monitoring that should be considered, such as:
Deep visibility: Get real-time statuses for live system insights.
Domain-specific: Identify the most important metric to monitor the components and add custom metrics.
Automatic monitoring: Components should be automatically identified and monitored.
Real-time integrated view: View the health of an entire application, i.e. all components, in a single view and in real-time.
Quick troubleshooting: Minimize the downtime and repair time of the system. The system should be smart enough to learn how to recover from failures that occurred previously.
In the next blog, I will discuss one of the solutions for the end-to-end monitoring of fast data applications.