Application Performance Monitoring (APM) tools are used by ops teams to monitor code performance for apps written in a select few languages such as Java, .NET, PHP, Python, node.js, and Ruby. The APM agents are plugged in at the time of application deployment. As traffic flows through the application, the agents collect three key metrics:
- average response time (latency)
- calls per minute (load)
- errors per minute (error rate)
If the APM tool supports distributed transaction tracing, the agents also collect summary stats on remote calls flowing out of the code to common relational databases, message queues, and other application code clusters.
To understand how APM solutions work, let’s understand the genealogy of APM. Before APM, developers would primarily leverage code profilers, which were essentially limited to pre-production phase of software or application lifecycle. However, these profilers were extremely intrusive and lead to a significant resource overhead on the application being instrumented. This overhead is measured in terms of additional latency as well as increased consumption of CPU and memory. Since the overhead introduced by code profilers was unacceptable for production environments, the first generation of APM tools such as those from Wily Technologies and Dynatrace were not widely used for always-on production monitoring use cases. Over the last few years, second generation APM players such as AppDynamics and New Relic have created products that work well even in production environments.
APM is critical to monitoring “code-first” applications, where the application code is the main component of the entire application stack and also the one that changes most frequently. However, APM is extremely limited when it comes to monitoring a new breed of applications called “data-first” applications. Data-first applications leverage large-scale data processing and analytics as their core. Multiple industry segments such as ad networks, marketing intelligence, IoT platforms, security analytics and fraud detection solutions have built their entire business on top of such applications. Architecturally, the data-first application can be divided into three distinct layers:
- Multiple distributed data frameworks – linked together to create a central, shared data infrastructure layer. This layer is typically organized as a data pipeline starting with an ingestion stage powered by a real-time message broker (e.g. Apache Kafka), followed by a stream processing engine (e.g. Apache Spark/Storm) and terminating with data sinks such as NoSQL stores (e.g. Cassandra/HBase), indexing engines (e.g. Elasticsearch/Solr), and in-memory caches (e.g. Redis/Memcache).
- Business logic – to integrate the multiple data frameworks, written as microservices code.
- Elastic host and container infrastructure – to power both the data infrastructure and microservices layers. This infrastructure could be either in a public cloud, such as Amazon Web Services or an on-premises datacenter.
APM for Data-First Applications
APM tools come up woefully short when applied to the data-first world given their inability to look beyond the code layer. The first problem is that they can’t see into the inner workings of distributed data frameworks. For example, detecting an Apache Kafka broker as a remote call is not good enough. Automatically discovering Kafka clusters and then automatically collecting and detecting anomalies in Kafka performance metrics is a must-have. To measure app health, it is vital to track the consumer lag on a per topic basis where topic represents the app level data partition.
Similarly, alerting whenever there are under-replicated or offline partitions is critical to monitoring overall Kafka cluster health. These issues get magnified in more complex frameworks such as Apache Spark where multiple distributed components such as master, workers, drivers, executors etc. come together at runtime. APM vendors may argue that since many of these frameworks are written in Java or JVM languages like Scala, these frameworks themselves can be instrumented by APM agents. However, this provides no additional value given critical performance metrics are already available on these frameworks via common endpoints such as JSON and JMX. Also, there is questionable value in understanding the method level details of the code that the application developers did not write themselves. Finally, there are overhead and transaction explosion concerns to worry about.
The second problem is the lack of correlated troubleshooting in data-first applications. Since distributed transaction tracing is not possible, there is no correlated, end-to-end performance view that can serve as the foundation for troubleshooting. For example, APM would not be able to establish root cause in the following scenario:“latest data not showing up on an Elasticsearch powered UI” because “Elasticsearch indexing rate dropped“ because “Apache Spark failed task rate going up” because “no available memory on Apache Spark worker nodes.”
A New Solution
OpsClarity is a unified application and infrastructure monitoring solution for the data-first world. In addition to providing deep visibility into the individual data frameworks (and the associated microservices and elastic infrastructure) through completely automated metric and metadata collection. OpsClarity applies data science constructs such as anomaly detection and event correlation to rapidly troubleshoot issues using common concerns such as throughput, latency, error rate, back pressure etc.
While OpsClarity can certainly co-exist with an APM tool in a data-first application, we see customers choosing to cover the application code layer by writing custom metrics to the OpsClarity Custom Metrics API. These custom metrics are then available for correlation with data framework and elastic infrastructure metrics.