Apps Depend on Real-Time Streaming Data, Here’s How to Manage Them
Data is now king. And just like with kings, knowing the right processes for dealing with data can make or break you.
Join the DZone community and get the full member experience.
Join For FreeVirtually every industry segment is experiencing a growing need for real-time data processing. In fact, a white paper by IDC projected that, by 2025, 25% of all data will be generated and need to be processed in real-time.
At the same time, the ability to run real-time apps at scale and with an easily available open-source technology is powering a new generation of real-time applications. What’s more, real-time apps can be built quickly, whether it’s simple apps for collecting data, machine learning apps that can deliver insights into customer behavior, or apps that provide views into IoT and device data.
That’s great so far, right? Well, yes, but now it’s time to step back and consider the challenges of managing streaming and real-time apps in light of how it can improve the performance of your big data operations. This capability is highly critical, given that much more than system performance is at risk: many apps, including streaming apps, are mission-critical, and, if they fail, personal safety, the security of systems, or the viability of an entire organization could be at risk.
Everything Hinges on Data
If there has ever been one pervasive, unstoppable force on enterprise networks, it’s data. It’s the seemingly quantum increases in data that is flowing into organizations annually. And nobody can yet fathom how much data will be flowing into the organization even a week from now, even if it can measure how much data it ingested this week. Data from ever more external sources and devices continuously flows into growing numbers of platforms where it needs to be analyzed. That data powers databases, or is sent to storage, or NoSQL systems – or can be put into queues to drive workflows. Just a few examples are:
In IoT, anomaly detection and prevention are mission-critical needs, enabling network personnel to spot failing devices on the network.
Info coming from devices in healthcare can quickly identify the health of a person and allow professionals to administer the right medicine at the right time.
In manufacturing, real-time anomaly detection identifies when processes are going bad – and could lead to millions of dollars in losses.
In security, fraud detection can spot intrusions in real-time, using logs and other security information from all the machines in a cluster.
In e-commerce, up-to-the-minute recommendations on customer sentiment can increase conversion rates significantly.
Streaming Data Architecture
Any architecture for application performance management (APM) needs to have application, user, services, compute, processing and storage level visibility. So, all the data that’s coming from sensors or database events flows into a stream store. A well-known example of a stream store is Apache Kafka, an open source stream-processing software platform.
An example of streaming interactive stores is Apache Hbase, an open source, non-relational, distributed database, or Apache Kudu, an open source column-oriented data store. These stores enable data to be stored and accessed in real-time – not in a continuous fashion, but more in terms of a get-and-put model. Other systems, such as Spark Streaming, an open source cluster computing framework, and Flink, an open source stream processing framework, are popular for collecting data and running analyses on it in real-time. The data can be sent to dashboards, new databases, or new ecosystems.
But take caution: When you create a streaming application that combines storage and compute systems, it quickly becomes apparent that these are all individually distributed systems that are connected by the application. If applications are not tuned, managed, or monitored, they will not be reliable.
For most practical purposes that means that the results an application is supposed to generate take too long to derive – and so what you devise to be a real-time application is not actually real-time! Understanding the cause of these problems can be a challenge, especially across distributed systems, such as Kafka or Spark Streaming. Perhaps the data has been being poorly partitioned or has been partitioned to take advantage of parallel processing.
Configuration parameters are another problem. There are many configuration parameters in these systems, and especially parameters that affect the interaction of one system with another. All the configurations must be “well-tuned,” for lack of a better term. When the application is created, and as the scale of this application grows over time, tuning can become extremely challenging. Changing one parameter can affect another. And it’s a virtual guarantee that resource contention and bottlenecks will occur because a number of resources, and different types of resources, are involved—from disk and storage to network bandwidth, to compute and memory.
Looking at the problem through another lens, you’ll find that no single tool can bring in all the metrics and logs from these individual systems—metrics and logs that are related to the real-time streaming applications—to get a top-down, application-centric view of why, for example, data is not being generated in real time, why an application stalls, or where data loss is happening.
And the kicker: without advanced analytics or machine learning on your big data stack, it’s nearly impossible to get recommendations of where the root cause of issues lies.
Full-Stack Intelligence
What application, data, and DevOps teams need are fully automated solutions that can uncover these problems and even remediate them in real-time. In short, they need Application Performance Management for the entire big data stack, an engine for delivering full-stack intelligence. With full-stack intelligence, operators can quickly pinpoint the root cause of the problem and can guide operators to the remedies for these problems. It allows you to answer the question, “Is the problem in the compute, the storage, or in the interaction of the two?” If you don’t know, it’s impossible to take remedial measures, such as applying policies on the individual systems.
Deriving full stack intelligence involves collecting monitoring information metrics, logs, SQL execution plans, and other information from every level of the big data stack. A platform for full stack intelligence can be built around Kafka for data ingest, Spark Streaming for real-time compute, and Cassandra Kudu for managing the state of the applications in real-time.
In operation, all the metrics and logs from these systems can be continuously streamed into the APM platform, where it takes this performance data, correlates data points in real time, and applies machine learning and artificial intelligence algorithms on the data to help big data and DevOps teams optimize troubleshooting, as well as to do capacity planning.
But the key to this architecture is a single, consolidated view of the stack, with all data coming into a single platform. That’s the very basis for ensuring full-stack intelligence. The alternative—logging in to multiple systems and seeing multiple views, where indicators are not correlated with each other—is futile from the start.
That’s why APM, especially for streaming or real-time data, needs to deploy machine learning algorithms and artificial intelligence on data flows to automatically pinpoint root causes of errors and to identify how to remediate them.
An architecture for APM gets extra credit if it uses ‘automated actions’ – so called because they enable the platform to automatically take actions subject to how the operations team has set policies. In short, policies drive actions, which allow DevOps teams to become proactive with big data stack management.
Therefore, in your decision to “build or buy” an APM platform for real-time and streaming applications on the big data stack or cluster, consider again what the platform must deliver:
Full stack management, which moves beyond mere infrastructure monitoring graphs and logs.
Intelligence, built from integrated machine learning and predictive analytics to resolve problems.
Automation, enabling a move beyond reactive trouble ticket escalations and to proactive operations.
That’s the recipe from a macro view. But to measure how effectively your platform will fulfill the vision of APM, especially for streaming and real-time applications, here’s a micro view of the requirements your APM platform must deliver:
Automatic root cause analysis of streaming app faults.
Anomaly detection, which rapidly detects and diagnoses unpredictable behavior.
Smart recommendations that make streaming apps faster and more resource efficient.
Proactive alerting and remediation of cluster problems caused by streaming apps, as well as identification of SLA impacts.
Single-pain-of-glass views for unified application and operations management.
If your APM platform delivers these capabilities, you’ll be equipped to manage the ever-higher amounts of data you ingest on the platform. That’s why, when the tenets of APM were defined, innovators knew that the technology could not waver in managing unforeseeable increases in data, delivering continued high performance to meet internal or external SLAs, and delivering near real-time responsiveness to prevent catastrophic system failures.
Opinions expressed by DZone contributors are their own.
Comments