Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service
Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.
Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
The Convergence of Testing and Observability
Log Analysis Using grep
Logging is arguably the most important element of your observability solution. Logs provide foundational and rich information about system behavior. In an ideal world, you would make all the decisions about logging and implement a consistent approach across your entire system. However, in the real world, you might work with legacy software or deal with different programming languages, frameworks, and open-source packages, each with its own format and structure for logging. With such a diversity in log formats across your system, what steps can you take to extract the most value from all your logs? That’s what we’ll cover in this post. We’ll look at how logs can be designed, the challenges and solutions to logging in large systems, and how to think about log-based metrics and long-term retention. Let’s dive in with a look at log levels and formats. Logging Design Many considerations go into log design, but the two most important aspects are the use of log levels and whether to use structured or unstructured log formats. Log Levels Log levels are used to categorize log messages based on their severity. Specific log levels used may vary depending on the logging framework or system. However, commonly used log levels include (in order of verbosity, from highest to lowest): TRACE: Captures every action the system takes, for reconstructing a comprehensive record and accounting for any state change. DEBUG: Captures detailed information for debugging purposes. These messages are typically only relevant during development and should not be enabled in production environments. INFO: Provides general information about the system's operation to convey important events or milestones in the system's execution. WARNING: Indicates potential issues or situations that might require attention. These messages are not critical but should be noted and investigated if necessary. ERROR: Indicates errors that occurred during the execution of the system. These messages typically highlight issues that need to be addressed and might impact the system's functionality. Logging at the appropriate level helps with understanding the system's behavior, identifying issues, and troubleshooting problems effectively. When it comes to system components that you build, we recommend that you devote some time to defining the set of log levels that are useful. Understand what kinds of information should be included in messages at each log level, and use the log levels consistently. Later, we’ll discuss how to deal with third-party applications, where you have no control over the log levels. We’ll also look at legacy applications that you control but are too expansive to migrate to the standard log levels. Structured Versus Unstructured Logs Entries in structured logs have a well-defined format, usually as key-value pairs or JSON objects. This allows for consistent and machine-readable log entries, making it easier to parse and analyze log data programmatically. Structured logging enables advanced log querying and analysis, making it particularly useful in large-scale systems. On the other hand, unstructured (free-form) logging captures messages in a more human-readable format, without a predefined structure. This approach allows developers to log messages more naturally and flexibly. However, programmatically extracting specific information from the resulting logs can be very challenging. Choosing between structured and unstructured logs depends on your specific needs and the requirements and constraints of your system. If you anticipate the need for advanced log analysis or integration with log analysis tools, structured logs can provide significant benefits. However, if all you need is simplicity and readability, then unstructured logs may be sufficient. In some cases, a hybrid approach can also be used, where you use structured logs for important events and unstructured logs for more general messages. For large-scale systems, you should lean towards structured logging when possible, but note that this adds another dimension to your planning. The expectation for structured log messages is that the same set of fields will be used consistently across system components. This will require strategic planning. Logging Challenges With systems comprising multiple components, each component will most likely have its own model to manage its logs. Let’s review the challenges this brings. Disparate Destinations Components will log to different destinations—files, system logs, stdout, or stderr. In distributed systems, collecting these scattered logs for effective use is cumbersome. For this, you’ll need a diversified approach to log collection, such as using installed collectors and hosted collectors from Sumo Logic. Varying Formats Some components will use unstructured, free-form logging, not following any format in particular. Meanwhile, structured logs may be more organized, but components with structured logs might employ completely different sets of fields. Unifying the information you get from a diversity of logs and formats requires the right tools. Inconsistent Log Levels Components in your system might use different ranges of log levels. Even if you consolidate all log messages into a centralized logging system (as you should), you will need to deal with the union of all log levels. One challenge that arises is when different log levels ought to be treated the same. For example, ERROR in one component might be the same as CRITICAL in another component, requiring immediate escalation. You face the opposite challenge when the same log level in different components means different things. For example, INFO messages in one component may be essential for understanding the system state, while in another component they might be too verbose. Log Storage Cost Large distributed systems accumulate a lot of logs. Collecting and storing these logs isn’t cheap. Log-related costs in the cloud can make up a significant portion of the total cost of the system. Dealing With These Challenges While the challenges of logging in large, distributed systems are significant, solutions can be found through some of the following practices. Aggregate Your Logs When you run a distributed system, you should use a centralized logging solution. As you run log collection agents on each machine in your system, these collectors will send all the logs to your central observability platform. Sumo Logic, which has always focused on log management and analytics, is best in class when it comes to log aggregation. Move Toward a Unified Format Dealing with logs in different formats is a big problem if you want to correlate log data for analytics and troubleshooting across applications and components. One solution is to transform different logs into a unified format. The level of effort for this task can be high, so consider doing this in phases, starting with your most essential components and working your way down. Establish a Logging Standard Across Your Applications For your own applications, work to establish a standard logging approach that adopts a uniform set of log levels, a single structured log format, and consistent semantics. If you also have legacy applications, evaluate the level of risk and cost associated with migrating them to adhere to your standard. If a migration is not feasible, treat your legacy applications like you would third-party applications. Enrich Logs From Third-Party Sources Enriching logs from third-party sources involves enhancing log data with contextual information from external systems or services. This brings a better understanding of log events, aiding in troubleshooting, analysis, and monitoring activities. To enrich your logs, you can integrate external systems (such as APIs or message queues) to fetch supplementary data related to log events (such as user information, customer details, or system metrics). Manage Log Volume, Frequency, and Retention Carefully managing log volume, frequency, and retention is crucial for efficient log management and storage. Volume: Monitoring generated log volume helps you control resource consumption and performance impacts. Frequency: Determine how often to log, based on the criticality of events and desired level of monitoring. Retention: Define a log retention policy appropriate for compliance requirements, operational needs, and available storage. Rotation: Periodically archive or purge older log files to manage log file sizes effectively. Compression: Compress log files to reduce storage requirements. Log-Based Metrics Metrics that are derived from analyzing log data can provide insights into system behavior and performance. Working log-based metrics has its benefits and challenges. Benefits Granular insights: Log-based metrics provide detailed and granular insights into system events, allowing you to identify patterns, anomalies, and potential issues. Comprehensive monitoring: By leveraging log-based metrics, you can monitor your system comprehensively, gaining visibility into critical metrics related to availability, performance, and user experience. Historical analysis: Log-based metrics provide historical data that can be used for trend analysis, capacity planning, and performance optimization. By examining log trends over time, you can make data-driven decisions to improve efficiency and scalability. Flexibility and customization: You can tailor your extraction of log-based metrics to suit your application or system, focusing on the events and data points that are most meaningful for your needs. Challenges Defining meaningful metrics: Because the set of metrics available to you across all your components is incredibly vast—and it wouldn’t make sense to capture them all—identifying which metrics to capture and extract from logs can be a complex task. This identification requires a deep understanding of system behavior and close alignment with your business objectives. Data extraction and parsing: Parsing logs to extract useful metrics may require specialized tools or custom parsers. This is especially true if logs are unstructured or formatted inconsistently from one component to the next. Setting this up can be time-consuming and may require maintenance as log formats change or new log sources emerge. Need for real-time analysis: Delays in processing log-based metrics can lead to outdated or irrelevant metrics. For most situations, you will need a platform that can perform fast, real-time processing of incoming data in order to leverage log-based metrics effectively. Performance impact: Continuously capturing component profiling metrics places additional strain on system resources. You will need to find a good balance between capturing sufficient log-based metrics and maintaining adequate system performance. Data noise and irrelevance: Log data often includes a lot of noise and irrelevant information, not contributing toward meaningful metrics. Careful log filtering and normalization are necessary to focus data gathering on relevant events Long-Term Log Retention After you’ve made the move toward log aggregation in a centralized system, you will still need to consider long-term log retention policies. Let’s cover the critical questions for this area. How Long Should You Keep Logs Around? How long you should keep a log around depends on several factors, including: Log type: Some logs (such as access logs) can be deleted after a short time. Other logs (such as error logs) may need to be kept for a longer time in case they are needed for troubleshooting. Regulatory requirements: Industries like healthcare and finance have regulations that require organizations to keep logs for a certain time, sometimes even a few years. Company policy: Your company may have policies that dictate how long logs should be kept. Log size: If your logs are large, you may need to rotate them or delete them more frequently. Storage cost: Regardless of where you store your logs—on-premise or in the cloud—you will need to factor in the cost of storage. How Do You Reduce the Level of Detail and Cost of Older Logs? Deleting old logs is, of course, the simplest way to reduce your storage costs. However, it may be a bit heavy-handed, and you sometimes may want to keep information from old logs around. When you want to keep information from old logs, but also want to be cost-efficient, consider taking some of these measures: Downsampling logs: In the case of components that generate many repetitive log statements, you might ingest only a subset of the statements (for example, 1 out of every 10). Trimming logs: For logs with large messages, you might discard some fields. For example, if an error log has an error code and an error description, you might have all the information you need by keeping only the error code. Compression and archiving: You can compress old logs and move them to cheaper and less accessible storage (especially in the cloud). This is a great solution for logs that you need to store for years to meet regulatory compliance requirements. Conclusion In this article, we’ve looked at how to get the most out of logging in large-scale systems. Although logging in these systems presents a unique set of challenges, we’ve looked at potential solutions to these challenges, such as log aggregation, transforming logs to a unified format, and enriching logs with data from third-party sources. Logging is a critical part of observability. By following the practices outlined in this article, you can ensure that your logs are managed effectively, enabling you to troubleshoot problems, identify issues, and gain insights into the behavior of your system. And you can do this while keeping your logging costs at bay.
In the world of modern software development, meticulous monitoring and robust debugging are paramount. With the rise of reactive programming paradigms, Spring WebFlux has emerged as a powerful framework for building reactive, scalable, and highly performant applications. However, as complexity grows, so does the need for effective logging mechanisms. Enter the realm of logging input requests in Spring WebFlux — a practice that serves as a critical foundation for both diagnosing issues and ensuring application security. Logging, often regarded as the unsung hero of software development, provides developers with invaluable insights into their applications' inner workings. Through comprehensive logs, developers can peer into the execution flow, troubleshoot errors, and track the journey of each request as it traverses through the intricate layers of their Spring WebFlux application. But logging is not a one-size-fits-all solution; it requires thoughtful configuration and strategic implementation to strike the balance between informative insights and performance overhead. In this article, we embark on a journey through the landscape of Spring WebFlux and delve into the art of logging input requests. We'll explore the nuances of intercepting and capturing crucial details of incoming requests, all while maintaining security and privacy standards. By the end, you'll be equipped with the knowledge to empower your Spring WebFlux application with insightful logs, fostering enhanced debugging, streamlined monitoring, and a fortified security posture. So, fasten your seatbelts as we unravel the techniques, best practices, and considerations for logging input requests in Spring WebFlux, and learn how this practice can elevate your application development to new heights. Action Although WebFilters are frequently employed to log web requests, we will choose to utilize AspectJ for this scenario. Assuming that all the endpoints in our project are located within a package named "controller" and that Controller classes end with the term "Controller," we can craft an advice method as depicted below. @Aspect @Component public class RequestLoggingAspect { @Around("execution (* my.cool.project.controller.*..*.*Controller.*(..))") public Object logInOut(ProceedingJoinPoint joinPoint) { Class<?> clazz = joinPoint.getTarget().getClass(); Logger logger = LoggerFactory.getLogger(clazz); Date start = new Date(); Object result = null; Throwable exception = null; try { result = joinPoint.proceed(); if (result instanceof Mono<?> monoOut) { return logMonoResult(joinPoint, clazz, logger, start, monoOut); } else if (result instanceof Flux<?> fluxOut) { return logFluxResult(joinPoint, clazz, logger, start, fluxOut); } else { return result; } } catch (Throwable e) { exception = e; throw e; } finally { if (!(result instanceof Mono<?>) && !(result instanceof Flux<?>)) { doOutputLogging(joinPoint, clazz, logger, start, result, exception); } } } } The RequestLoggingAspect stands out for its adept handling of diverse return types, including Flux, Mono, and non-WebFlux, within a Spring WebFlux framework. Employing the AspectJ @Around annotation, it seamlessly intercepts methods in "Controller" classes, offering tailored logging for each return type. Below is the logMonoResult method, which efficiently logs with contextView to retrieve contextual data from the WebFlux environment. This method adeptly handles Mono return types, capturing various scenarios while maintaining a structured logging approach. It gracefully integrates deferred contextual information and ensures seamless logging of different outcomes. From handling empty results to tracking successes and errors, the logMonoResult method seamlessly facilitates detailed logging within the Spring WebFlux context: private <T, L> Mono<T> logMonoResult(ProceedingJoinPoint joinPoint, Class<L> clazz, Logger logger, Date start, Mono<T> monoOut) { return Mono.deferContextual(contextView -> monoOut .switchIfEmpty(Mono.<T>empty() .doOnSuccess(logOnEmptyConsumer(contextView, () -> doOutputLogging(joinPoint, clazz, logger, start, "[empty]", null)))) .doOnEach(logOnNext(v -> doOutputLogging(joinPoint, clazz, logger, start, v, null))) .doOnEach(logOnError(e -> doOutputLogging(joinPoint, clazz, logger, start, null, e))) .doOnCancel(logOnEmptyRunnable(contextView, () -> doOutputLogging(joinPoint, clazz, logger, start, "[cancelled]", null))) ); } Likewise, the logFluxResult method is presented below. This method orchestrates comprehensive logging while seamlessly incorporating the contextView to obtain contextual information from the WebFlux environment. By accommodating diverse scenarios, such as empty results or cancellations, the logFluxResult method optimizes logging within the Spring WebFlux ecosystem: private <T> Flux<T> logFluxResult(ProceedingJoinPoint joinPoint, Class<?> clazz, Logger logger, Date start, Flux<T> fluxOut) { return Flux.deferContextual(contextView -> fluxOut .switchIfEmpty(Flux.<T>empty() .doOnComplete(logOnEmptyRunnable(contextView, () -> doOutputLogging(joinPoint, clazz, logger, start, "[empty]", null)))) .doOnEach(logOnNext(v -> doOutputLogging(joinPoint, clazz, logger, start, v, null))) .doOnEach(logOnError(e -> doOutputLogging(joinPoint, clazz, logger, start, null, e))) .doOnCancel(logOnEmptyRunnable(contextView, () -> doOutputLogging(joinPoint, clazz, logger, start, "[cancelled]", null))) ); } Let's delve into the details of the logOnNext, logOnError, logOnEmptyConsumer, and logOnEmptyRunnable methods, explaining how they contribute to comprehensive request logging. These methods encapsulate intricate logging procedures and utilize the contextView to maintain contextual information from the WebFlux environment. The combination of MDC (Mapped Diagnostic Context) and signal processing ensures precise logging under various scenarios: logOnNext Method: The logOnNext method is designed to log information when a signal indicates a successful next event. It uses the signal's contextView to extract contextual variables such as transaction ID ( TRX_ID) and path URI ( PATH_URI). Later we will describe how such values can be put to context. These variables are then included in the MDC to enable consistent tracking throughout the logging process. The logging statement is encapsulated within the MDC context, guaranteeing that the correct transaction and path details are associated with the log statement. This approach ensures that successful events are accurately logged within the relevant context. private static <T> Consumer<Signal<T>> logOnNext(Consumer<T> logStatement) { return signal -> { if (!signal.isOnNext()) return; String trxIdVar = signal.getContextView().getOrDefault(TRX_ID, ""); String pathUriVar = signal.getContextView().getOrDefault(PATH_URI, ""); try (MDC.MDCCloseable trx = MDC.putCloseable(TRX_ID, trxIdVar); MDC.MDCCloseable path = MDC.putCloseable(PATH_URI, pathUriVar)) { T t = signal.get(); logStatement.accept(t); } }; } logOnError Method: The logOnError method mirrors the behavior of logOnNext, but it focuses on error events. It extracts the contextual variables from the signal's contextView and places them in the MDC. This ensures that errors are logged in the proper context, making it easier to identify the specific transaction and path associated with the error event. By encapsulating the error log statement within the MDC, this method ensures that error logs are informative and appropriately contextualized. public static <T> Consumer<Signal<T>> logOnError(Consumer<Throwable> errorLogStatement) { return signal -> { if (!signal.isOnError()) return; String trxIdVar = signal.getContextView().getOrDefault(TRX_ID, ""); String pathUriVar = signal.getContextView().getOrDefault(PATH_URI, ""); try (MDC.MDCCloseable trx = MDC.putCloseable(TRX_ID, trxIdVar); MDC.MDCCloseable path = MDC.putCloseable(PATH_URI, pathUriVar)) { errorLogStatement.accept(signal.getThrowable()); } }; } logOnEmptyConsumer and logOnEmptyRunnable Methods: Both of these methods deal with scenarios where the signal is empty, indicating that there's no result to process. The logOnEmptyConsumer method is designed to accept a Consumer and executes it when the signal is empty. It retrieves the contextual variables from the provided contextView and incorporates them into the MDC before executing the log statement. private static <T> Consumer<T> logOnEmptyConsumer(final ContextView contextView, Runnable logStatement) { return signal -> { if (signal != null) return; String trxIdVar = contextView.getOrDefault(TRX_ID, ""); String pathUriVar = contextView.getOrDefault(PATH_URI, ""); try (MDC.MDCCloseable trx = MDC.putCloseable(TRX_ID, trxIdVar); MDC.MDCCloseable path = MDC.putCloseable(PATH_URI, pathUriVar)) { logStatement.run(); } }; } private static Runnable logOnEmptyRunnable(final ContextView contextView, Runnable logStatement) { return () -> { String trxIdVar = contextView.getOrDefault(TRX_ID, ""); String pathUriVar = contextView.getOrDefault(PATH_URI, ""); try (MDC.MDCCloseable trx = MDC.putCloseable(TRX_ID, trxIdVar); MDC.MDCCloseable path = MDC.putCloseable(PATH_URI, pathUriVar)) { logStatement.run(); } }; } In both cases, these methods ensure that the correct context, including transaction and path details, is established through MDC before executing the log statements. This allows for consistent and meaningful logging even in situations where there is no explicit result to process. To introduce the transaction ID and path variables into the WebFlux context, consider the following WebFilter configuration. As a @Bean with highest priority, the slf4jMdcFilter extracts the request's unique ID and path URI, incorporating them into the context. This ensures that subsequent processing stages, including the RequestLoggingAspect, can seamlessly access this enriched context for precise and comprehensive request handling. @Bean @Order(Ordered.HIGHEST_PRECEDENCE) WebFilter slf4jMdcFilter() { return (exchange, chain) -> { String requestId = exchange.getRequest().getId(); return chain.filter(exchange) .contextWrite(Context.of(Constants.TRX_ID, requestId) .put(Constants.PATH_URI, exchange.getRequest().getPath())); }; } Ultimately, for comprehensive logging of diverse request types, the inclusion of a method named doOutputLogging becomes essential. While a detailed implementation of this method is beyond our scope, it serves as a conduit for logging incoming expressions, either via a tailored logger to match your scenario or potentially routed to a database or alternate platform. This method can be customized to align precisely with your distinct necessities and specifications. private <T> void doOutputLogging(final ProceedingJoinPoint joinPoint, final Class<?> clazz, final Logger logger, final Date start, final T result, final Throwable exception) { //log(...); //db.insert(...); } Summary In summary, effective request logging in Spring WebFlux is pivotal for debugging and enhancing application performance. By leveraging AspectJ and WebFilters, developers can simplify the process of logging input and output across diverse endpoints. The showcased RequestLoggingAspect efficiently handles different return types, while the slf4jMdcFilter WebFilter enriches logs with transaction and path data. Although the logMonoResult, logFluxResult, and doOutputLogging methods serve as adaptable templates, they offer customization options to suit specific needs. This empowers developers to tailor logging to their preferences, whether for internal logs or external data storage.
Once we press the merge button, that code is no longer our responsibility. If it performs sub-optimally or has a bug, it is now the problem of the DevOps team, the SRE, etc. Unfortunately, those teams work with a different toolset. If my code uses up too much RAM, they will increase RAM. When the code runs slower, it will increase CPU. In case the code crashes, they will increase concurrent instances. If none of that helps, they will call you up at 2 AM. A lot of these problems are visible before they become a disastrous middle-of-the-night call. Yes. DevOps should control production, but the information they gather from production is useful for all of us. This is at the core of developer observability, which is a subject I’m quite passionate about. I’m so excited about it I dedicated a chapter to it in my debugging book. Back when I wrote that chapter, I dedicated most of it to active developer observability tools like Lightrun, Rookout, et al. These tools work like production debuggers. They are fantastic in that regard. When I have a bug and know where to look, I can sometimes reach for one of these tools (I used to work at Lightrun, so I always use it). But there are other ways. Tools like Lightrun are active in their observability; we add a snapshot similarly to a breakpoint and get the type of data we expect. I recently started playing with Digma, which takes a radically different approach to developer observability. To understand that, we might need to revisit some concepts of observability first. Observability Isn’t Pillars I’ve been guilty of listing the pillars of observability just as much as the next guy. They’re even in my book (sorry). To be fair, I also discussed what observability really means… Observability means we can ask questions about our system and get answers or at least have a clearly defined path to get those answers. Sounds simple when running locally, but when you have a sophisticated production environment, and someone asks you: is anyone even using that block of code? How do you know? You might have lucked out and had a login that code, and it might still be lucky that the log is in the right level and piped properly so you can check. The problem is that if you added too many logs or too much observability data, you might have created a disease worse than the cure: over-logging or over-observing. Both can bring down your performance and significantly impact the bank account, so ideally, we don’t want too many logs (I discuss over-logging here), and we don’t want too much observability. Existing developer observability tools work actively. To answer the question, if someone is using the code, I can place a counter on the line and wait for results. I can give it a week's timeout and find out in a week. Not a terrible situation but not ideal either. I don’t have that much patience. Tracing and OpenTelemetry It’s a sad state of affairs that most developers don’t use tracing in their day-to-day job. For those of you who don’t know it, it is like a call stack for the cloud. It lets us see the stack across servers and through processes. No, not method calls. More at the entry point level, but this often contains details like the database queries that were made and similarly deep insights. There’s a lot of history with OpenTelemetry, which I don’t want to get into. If you’re an observability geek, you already know it, and if not, then it’s boring. What matters is that OpenTelemetry is taking over the world of tracing. It’s a runtime agent, which means you just add it to the server, and you get tracing information almost seamlessly. It’s magic. It also doesn’t have a standard server which makes it very confusing. That means multiple vendors can use a single agent and display the information it collects to various demographics: A vendor focused on performance can show the timing of various parts in the system. A vendor focused on troubleshooting can detect potential bugs and issues. A vendor focused on security can detect potential risky access. Background Developer Observability I’m going to coin a term here since there isn’t one: Background Developer Observability. What if the data you need was already here and a system already collected it for you in the background? That’s what Digma is doing. In Digma's terms, it's called Continuous Feedback. Essentially, they’re collecting OpenTelemetry data, analyzing it, and displaying it as information that’s useful for developers. If Lightrun is like a debugger, then Digma is like SonarQube based on actual runtime and production information. The cool thing is that you probably already use OpenTelemetry without even knowing it. DevOps probably installed that agent already, and the data is already there! Going back to my question, is anyone using this API? If you use Digma, you can see that right away. OpenTelelbery already collected the information in the background, and the DevOps team already paid the price of collection. We can benefit from that too. Enough Exposition I know, I go on… Let’s get to the meat and potatoes of why this rocks. Notice that this is a demo; when running locally, the benefits are limited. The true value of these tools is in understanding production, still, they can provide a lot of insight even when running locally and even when running tests. Digma has a simple and well-integrated setup wizard for IntelliJ/IDEA. You need to have Docker Desktop running for setup to succeed. Note that you don’t need to run your application using Docker. This is simply for the Digma server process, where they collect the execution details. Once it is installed, we can run our application. In my case, I just ran the JPA unit test from my latest book, and it produced standard traces, which are already pretty cool. We can see them listed below: When we click a trace for one of these, we get the standard trace view, this is nothing new, but it’s really nice to see this information directly in the IDE and readily accessible. I can imagine the immense value this will have for figuring out CI execution issues: But the real value and where Digma becomes a “Developer Observability” tool instead of an Observability tool is with the tool window here: There is a strong connection to the code directly from the observability data and deeper analysis, which doesn’t show in my particular overly simplistic hello world. This Toolwindow highlights problematic traces and errors and helps understand real-world issues. How Does This Help at 2 AM? Disasters happen because we aren’t looking. I’d like to say I open my observability dashboard regularly, but I don’t. Then when there’s a failure I take a while to get my bearings within it. The locality of the applicable data is important. It helps us notice issues when they happen. Detect regressions before they turn to failures and understand the impact of the code we just merged. Prevention starts with awareness, and as developers, we handed our situational awareness to the DevOps team. When the failure actually happens, the locality and accessibility of the data make a big difference. Since we use tools that integrate into the IDE daily, this reduces the meantime to a fix. No, a background developer observability tool might not include the information we need to fix a problem. But if it does, then the information is already there, and we need nothing else. That is fantastic. Final Word With all the discussion about observability and OpenTelemetry, you would think everyone is using them. Unfortunately, the reality is far from that. Yes, there’s some saturation and familiarity in the DevOps crowd. This is not the case for developers. This is a form of environmental blindness. How can our teams, who are so driven by data and facts, proceed with secondhand and often outdated data from OPS? Should I spend time further optimizing this method, or will I waste the effort since few people use it? We can benchmark things locally just fine, but real-world usage and impact are things that we all need to improve.
Are you interested in open-source observability but lack the knowledge to just dive right in? This workshop is for you, designed to expand your knowledge and understanding of open-source observability tooling that is available to you today. Dive right into a free, online, self-paced, hands-on workshop introducing you to Prometheus. Prometheus is an open-source systems monitoring and alerting tool kit that enables you to hit the ground running with discovering, collecting, and querying your observability today. Over the course of this workshop, you will learn what Prometheus is, what it is not, install it, start collecting metrics, and learn all the things you need to know to become effective at running Prometheus in your observability stack. Previously, I shared an introduction to Prometheus, installing Prometheus, an introduction to the query language, exploring basic queries, using advanced queries, and relabeling metrics in Prometheus as free online labs. In this article, you'll learn all about discovering service targets in Prometheus. Your learning path takes you into the wonderful world of service discovery in Prometheus, where you explore the more realistic and dynamic world of cloud-native services that automatically scale up and down. Note this article is only a short summary, so please see the complete lab found online to work through it in its entirety yourself: The following is a short overview of what is in this specific lab of the workshop. Each lab starts with a goal. In this case, it is as follows: This lab provides an understanding of how service discovery is used in Prometheus for locating and scraping targets for metrics collection. You're learning by setting up a service discovery mechanism to dynamically maintain a list of scraping targets. You start in this lab exploring the service discovery architecture Prometheus provides and how it is supporting all manner of automated discovery of dynamically scaling targets in your infrastructure, the basic definitions of what service discovery needs to achieve, knowing what targets should exist, knowing how to pull metrics from those targets, and how to use the associated target metadata. You then dive into the two options for installing the lab demo environments, either using source projects or in open-source containers for the exercises later in this lab. The Demo Environment Whether you install it using source projects or containers, you'll be setting up the following architecture to support your service discovery exercises using the services demo to ensure your local infrastructure contains the following: Production 1 running at http://localhost:11111 Production 2 running at http://localhost:22222 Development running at http://localhost:44444 Note that if you have any port conflicts on your machine, you can map any free port numbers you like, making this exercise very flexible across your available machines. Next, you'll be setting up a file-based discovery integration with Prometheus that allows your applications and pipelines to modify a file for dynamic targeting of the infrastructure you want to scrape. This file (targets.yml) in our exercise will look something like this if you are targeting the above infrastructure: - targets: - "localhost:11111" - "localhost:22222" labels: job: "services" env: "production" - targets: - "localhost:44444" labels: job: "services" env: "development" Configuring your Prometheus instance requires a new file-based discovery section in your workshop-prometheus.yml file: # workshop config global: scrape_interval: 5s scrape_configs: # Scraping Prometheus. - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] # File based discovery. - job_name: "file-sd-workshop" file_sd_configs: - files: - "targets.yml" After saving your configuration and starting your Prometheus instance, you are then shown how to verify that the target infrastructure is now being scraped: Next up, you'll start adding dynamic changes to your target file and see that they are automatically discovered by Prometheus without having to restart your instance. Exploring Dynamic Discovery The rest of the lab walks through multiple exercises where you make dynamic changes and verify that Prometheus is able to automatically scale to the needs of your infrastructure. For example, you'll first change the infrastructure you have deployed by promoting the development environment to become the staging infrastructure for your organization. First, you update the targets file: - targets: - "localhost:11111" - "localhost:22222" labels: job: "services" env: "production" - targets: - "localhost:44444" labels: job: "services" env: "staging" Then you verify that the changes are picked up, this time using a PromQL query and the Prometheus console without having to restart your Prometheus instance: Later in the lab, you are given exercises to fly solo and add a new testing environment so that the end results of your dynamically growing observability infrastructure contain production, staging, testing, and your Prometheus instance: Missed Previous Labs? This is one lab in the more extensive free online workshop. Feel free to start from the very beginning of this workshop here if you missed anything previously: You can always proceed at your own pace and return any time you like as you work your way through this workshop. Just stop and later restart Perses to pick up where you left off. Coming Up Next I'll be taking you through the following lab in this workshop where you'll learn all about instrumenting your applications for collecting Prometheus metrics. Stay tuned for more hands-on material to help you with your cloud-native observability journey.
One of the components of my OpenTelemetry demo is a Rust application built with the Axum web framework. In its description, axum mentions: axum doesn't have its own middleware system but instead uses tower::Service. This means axum gets timeouts, tracing, compression, authorization, and more, for free. It also enables you to share middleware with applications written using hyper or tonic. — axum README So far, I was happy to let this cryptic explanation lurk in the corner of my mind, but today is the day I want to understand what it means. Like many others, this post aims to explain to me and others how to do this. The tower crate offers the following information: Tower is a library of modular and reusable components for building robust networking clients and servers. Tower provides a simple core abstraction, the Service trait, which represents an asynchronous function taking a request and returning either a response or an error. This abstraction can be used to model both clients and servers. Generic components, like timeouts, rate limiting, and load balancing, can be modeled as Services that wrap some inner service and apply additional behavior before or after the inner service is called. This allows implementing these components in a protocol-agnostic, composable way. Typically, such services are referred to as middleware. — tower crate Tower is designed around Functional Programming and two main abstractions, Service and Layer. In its simplest expression, a Service is a function that reads an input and produces an output. It consists of two methods: One should call poll_ready() to ensure that the service can process requests call() processes the request and returns the response asynchronously Because calls can fail, the return value is wrapped in a Result. Moreover, since Tower deals with asynchronous calls, the Result is wrapped in a Future. Hence, a Service transforms a Self::Request into a Future<Result>, with Request and Response needing to be defined by the developer. The Layer trait allows composing Services together. Here's a slightly more detailed diagram: A typical Service implementation will wrap an underlying component; the component may be a service itself. Hence, you can chain multiple features by composing various functions. The call() function implementation usually executes these steps in order, all of them being optional: Pre-call Call the wrapped component Post-call For example, a logging service could log the parameters before the call, call the logged component, and log the return value after the call. Another example would be a throttling service, which limits the rate of calls of the wrapped service: it would read the current status before the call and, if above a configured limit, would return immediately without calling the wrapped component. It will call the component and increment the status if the status is valid. The role of a layer would be to take one service and wrap it into the other. With this in mind, it's relatively easy to check the axum-tracing-opentelemetry crate and understand what it does. It offers two services with their respective layers: one is to extract the trace and span IDs from an HTTP request, and another is to send the data to the OTEL collector. Note that Tower comes with several out-of-the-box services, each available via a feature crate: balance: load-balance requests buffer: MPSC buffer discover: service discovery filter: conditional dispatch hedge: retry slow requests limit: limit requests load: load measurement retry: retry failed requests timeout: timeout requests Finally, note that Tower comes in three crates: tower is the public crate, while tower-service and tower-layer are considered less stable. In this post, we have explained what is the Tower library: it's a Functional Programming library that provides function composition. If you come from the Object-Oriented Programming paradigm, it's similar to the Decorator pattern. It builds upon two abstractions, Service is the function and Layer composes functions. It's widespread in the Rust ecosystem, and learning it is a good investment. To go further: Axum Tower documentation Tower crate Axum_tracing_opentelemetry documentation
Intro to Istio Observability Using Prometheus Istio service mesh abstracts the network from the application layers using sidecar proxies. You can implement security and advance networking policies to all the communication across your infrastructure using Istio. But another important feature of Istio is observability. You can use Istio to observe the performance and behavior of all your microservices in your infrastructure (see the image below). One of the primary responsibilities of Site reliability engineers (SREs) in large organizations is to monitor the golden metrics of their applications, such as CPU utilization, memory utilization, latency, and throughput. In this article, we will discuss how SREs can benefit from integrating three open-source software- Istio, Prometheus, and Grafana. While Istio is the most famous service software, Prometheus is the most widely used monitoring software, and Grafana is the most famous visualization tool. Note: The steps are tested for Istio 1.17.X Watch the Video of Istio, Prometheus, and Grafana Configuration Watch the video if you want to follow the steps from the video: Step 1: Go to Istio Add-Ons and Apply Prometheus and Grafana YAML File First, go to the add-on folder in the Istio directory using the command. Since I am using 1.17.1, the path for me is istio-1.17.1/samples/addons You will notice that Istio already provides a few YAML files to configure Grafana, Prometheus, Jaeger, Kiali, etc. You can configure Prometheus by using the following command: Shell kubectl apply -f prometheus.yaml Shell kubectl apply -f grafana.yaml Note these add-on YAMLs are applied to istio-system namespace by default. Step 2: Deploy New Service and Port-Forward Istio Ingress Gateway To experiment with the working model, we will deploy the httpbin service to an Istio-enabled namespace. We will create an object of the Istio ingress gateway to receive the traffic to the service from the public. We will also port-forward the Istio ingress gateway to a particular port-7777. You should see the below screen at localhost:7777 Step 3: Open Prometheus and Grafana Dashboard You can open the Prometheus dashboard by using the following command. Shell istioctl dashboard prometheus Shell istioctl dashboard grafana Both the Grafana and Prometheus will open in the localhost. Step 4: Make HTTP Requests From Postman We will see how the httpbin service is consuming CPU or memory when there is a traffic load. We will create a few GET and POST requests to the localhost:7777 from the Postman app. Once you GET or POST requests to httpbin service multiple times, there will be utilization of resources, and we can see them in Grafana. But at first, we need to configure the metrics for httpbin service in Prometheus and Grafana. Step 5: Configuring Metrics in Prometheus One can select a range of metrics related to any Kubernetes resources such as API server, applications, workloads, envoy, etc. We will select container_memory_working_set_bytes metrics for our configuration. In the Prometheus application, we will select the namespace to scrape the metrics using the following search term: container_memory_working_set_bytes { namespace= “istio-telemetry”} (istio-telemetry is the name of our Istio-enabled namespace, where httpbin service is deployed) Note that, simply running this, we get the memory for our namespace. Since we want to analyze the memory usage of our pods, we can calculate the total memory consumed by summing the memory usage of each pod grouped by pod. The following query will help us in getting the desired result : sum(container_memory_working_set_bytes{namespace=”istio-telemetry”}) by (pod) Note: Prometheus provides a lot of flexibility to filter, slice, and dice the metric data. The central idea of this article was to showcase the ability of Istio to emit and send metrics to Prometheus for collection Step 6: Configuring Istio Metrics Graphs in Grafana Now, you can simply take the query sum(container_memory_working_set_bytes{namespace=”istio-telemetry”}) by (pod) in Prometheus and plot a graph with time. All you need to do is create a new dashboard in Grafana and paste the query into the metrics browser. Grafana will plot a time-series graph. You can edit the graph with proper names, legends, and titles for sharing with other stakeholders in the Ops team. There are several ways to tweak and customize the data and depict the Prometheus metrics in Grafana. You can choose to make all the customization based on your enterprise needs. I have done a few experiments in the video; feel free to check it out. Conclusion Istio service mesh is extremely powerful in providing overall observability across the infrastructure. In this article, we have just offered a small use case of metrics scrapping and visualization using Istio, Prometheus, and Grafana. You can perform logging and tracing of logs and real-time traffic using Istio; we will cover those topics in our subsequent blogs.
It is impossible to know with certainty what is happening inside a remote or distributed system, even if they are running on a local machine. Telemetry is precisely what provides all the information about the internal processes. This data can take the form of logs, metrics, and Distributed Traces. In this article, I will explain all of the forms separately. I will also explain the benefits of OpenTelemetry protocol and how you can configure telemetry flexibly. Telemetry: Why Is It Needed? It provides insights into the current and past states of a system. This can be useful in various situations. For example, it can reveal that system load reached 70% during the night, indicating the need to add a new instance. Metrics can also show when errors occurred and which specific traces and actions within those traces failed. It demonstrates how users interact with the system. Telemetry allows us to see which user actions are popular, which buttons users frequently click, and so on. This information helps developers, for example, to add caching to actions. For businesses, this data is important for understanding how the system is actually being used by people. It highlights areas where system costs can be reduced. For instance, if telemetry shows that out of four instances, only one is actively working at 20% capacity, it may be possible to eliminate two unnecessary instances (with a minimum of two instances remaining). This allows for adjusting the system's capacity and reducing maintenance costs accordingly. Optimizing CI/CD pipelines. These processes can also be logged and analyzed. For example, if a specific step in the build or deployment process is taking a long time, telemetry can help identify the cause and when these issues started occurring. It can also provide insights for resolving such problems. Other purposes. There can be numerous non-standard use cases depending on the system and project requirements. You collect the data, and how it is processed and utilized depends on the specific circumstances. Logging everything may be necessary in some cases, while in others, tracing central services might be sufficient. No large IT system or certain types of businesses can exist without telemetry. Therefore, this process needs to be maintained and implemented in projects where it is not yet present. Types of Data in Telemetry Logs Logs are the simplest type of data in telemetry. There are two types of logs: Automatic logs: These are generated by frameworks or services (such as Azure App Service). With automatic logs, you can log details such as incoming and outgoing requests, as well as the contents of the requests. No additional steps are required to collect these logs, which is convenient for routine tasks. Manual logs: These logs need to be manually triggered. They are not as straightforward as automatic logs, but they are justified for logging important parts of the system. Typically, these logs capture information about resource-intensive processes or those related to specific business tasks. For example, in an education system, it would be crucial not to lose students' test data for a specific period. info: ReportService.ReportServiceProcess[0] Information: {"Id":0,"Name":"65", "TimeStamp":"2022-06-15T11:09:16.2420721Z"} info: ReportService. ReportServiceProcess[0] Information: {"Id":0,"Name":"85","TimeStamp":"2022-06-15T11:09:46.5739821Z"} Typically, there is no need to log all data. Evaluate the system and identify (if you haven't done so already) the most vulnerable and valuable parts of the system. Most likely, those areas will require additional logging. Sometimes, you will need to employ log-driven programming. In my experience, there was a desktop application project on WPW that had issues with multi-threading. The only way to understand what was happening was to log every step of the process. Metrics Metrics are more complex data compared to logs. They can be valuable for both development teams and businesses. Metrics can also be categorized as automatic or manual: Automatic metrics are provided by the system itself. For example, in Windows, you can see metrics such as CPU utilization, request counts, and more. The same principle applies to the Monitoring tab when deploying a virtual machine on AWS or Azure. There, you can find information about the amount of data coming into or going out of the system. Manual metrics can be added by you. For instance, when you need to track the current number of subscriptions to a service. This can be implemented using logs, but the metrics provide a more visual and easily understandable representation, especially for clients. Distributed Trace This data is necessary for working with distributed systems that are not running on a single instance. In such cases, we don't know which instance or service is handling a specific request at any given time. It all depends on the system architecture. Here are some possible scenarios: In the first diagram, the client sends a request to the BFF, which then separately forwards it to three services. In the center, we see a situation where the request goes from the first service to the second, and then to the third. The diagram on the right illustrates a scenario where a service sends requests to a Message Broker, which further distributes them between the second and third services. I'm sure you've come across similar systems, and there are countless examples. These architectures are quite different from monoliths. In systems with a single instance, we have visibility into the call stack from the controller to the database. Therefore, it is relatively easy to track what happened during a specific API call. Most likely, the framework provides this information. However, in distributed systems, we can't see the entire flow. Each service has its own logging system. When sending a request to the BFF, we can see what happens within that context. However, we don't know what happened within services 1, 2, and 3. This is where Distributed Trace comes in. Here is an example of how it works: Let's examine this path in more detail… The User Action goes to the API Gateway, then to Service A, and further to Service B, resulting in a call to the database. When these requests are sent to the system, we receive a trace similar to the one shown. Here, the duration of each process is clearly visible: from User Action to the Database. For example, we can see that the calls were made sequentially. The time between the API Gateway and Service A was spent on setting up the HTTP connection, while the time between Service B and the Database was needed for database setup and data processing. Therefore, we can assess how much time was spent on each operation. This is possible thanks to the Correlation ID mechanism. What is the essence of it? Typically, in monolithic applications, logs and actions are tied to process ID or thread ID during logging. Here, the mechanism is the same, but we manually add it to the requests. Let's look at an example: When the Order Service action starts in the Web Application, it sees the added Correlation ID. This allows the service to understand that it is part of a chain and passes the "marker" to the next services. They, in turn, see themselves as part of a larger process. As a result, each component logs data in a way that allows the system to see everything happening during a multi-stage action. The transmission of the Correlation ID can be done in different ways. For example, in HTTP, this data is often passed as one of the header parameters. In Message Broker services, it is typically written inside the message. However, there are likely SDKs or libraries available in each platform that can help implement this functionality. How OpenTelemetry Works Often, the telemetry format of an old system is not supported in a new one. This leads to many issues when transitioning from one system to another. For example, this was the case with AppInsight and CloudWatch. The data was not grouped properly, and something was not working as expected. OpenTelemetry helps overcome such problems. It is a data transfer protocol in the form of unified libraries from OpenCensus and OpenTracing. The former was developed by Google for collecting metrics and traces, while the latter was created by Uber experts specifically for traces. At some point, the companies realized that they were essentially working on the same task. Therefore, they decided to collaborate and create a universal data representation format. Thanks to the OTLP protocol, logs, metrics, and traces are sent in a unified format. According to the OpenTelemetry repository, prominent IT giants contribute to this project. It is in demand in products that collect and display data, such as Datadog and New Relic. It also plays a significant role in systems that require telemetry, including Facebook, Atlassian, Netflix, and others. Key Components of the OTLP Protocol Cross-language specification: This is a set of interfaces that need to be implemented to send logs, metrics, and traces to a telemetry visualization system. SDK: These are implemented parts in the form of automatic traces, metrics, and logs. Essentially, they are libraries connected to the framework. With them, you can view the necessary information without writing any code. There are many SDKs available for popular programming languages. However, they have different capabilities. Pay attention to the table. Tracing has stable versions everywhere except for the PHP and JS SDKs. On the other hand, metrics and logs are not yet well-implemented in many languages. Some have only alpha versions, some are experimental, and in some cases, the protocol implementation is missing altogether. From my experience, I can say that everything works fine with services on .NET. It provides easy integration and reliable logging. Collector: This is the main component of OpenTelemetry. It is a software package that is distributed as an exe, pkg, or Docker file. The collector consists of four components: Receivers: These are the data sources for the collector. Technically, logs, metrics, and traces are sent to the receivers. They act as access points. Receivers can accept OTLP from Jaeger or Prometheus. Processors: These can be launched for each data type. They filter data, add attributes, and customize the process for specific system or project requirements. Exporters: These are the final destinations for sending telemetry. From here, data can be sent to OTLP, Jaeger, or Prometheus. Extensions: These tools extend the functionality of the collector. One example is the health_check extension, which allows sending a request to an endpoint to check if the collector is working. Extensions provide various insights, such as the number of receivers and exporters in the system and their operation status. In this diagram, we have two types of data: metrics and logs (represented by different colors). Logs go through their processor to Jaeger, while metrics go through another processor, have their own filter, and are sent to two data sources: OTLP and Prometheus. This provides flexible data analysis capabilities, as different software has different ways of displaying telemetry. An interesting point: data can be received from OpenTelemetry and sent back to it. In certain cases, you can send the same data to the same collector. OTLP Deployment There are many ways to build a telemetry collection system. One of the simplest schemes is shown in the illustration below. It involves a single .NET service that sends OpenTelemetry directly to New Relic: If needed, the scheme can be enhanced with an agent. The agent can act as a host service or a background process within the service, collecting data and sending it to New Relic: Moving forward, let's add another application to the scheme (e.g., a Node.js application). It will send data directly to the collector, while the first application will do it through its own agent using OTLP. The collector will then send the data to two systems. For example, metrics will go to New Relic, and logs will go to Datadog: You can also add Prometheus as a data source here. For instance, when someone on the team prefers this tool and wants to use it. However, the data will still be collected in New Relic and Datadog: The telemetry system can be further complicated and adapted to your project. Here's another example: Here, there are multiple collectors, each collecting data in its own way. The agent in the .NET application sends data to both New Relic and the collector. One collector can send information to another because OTLP is sent to a different data source. It can perform any action with the data. As a result, the first collector filters the necessary data and passes it to the next one. The final collector distributes logs, metrics, and traces among New Relic, Datadog, and Azure Monitor. This mechanism allows you to analyze telemetry in a way that is convenient for you. Exploring OpenTelemetry Capabilities Let's dive into the practical aspects of OpenTelemetry and examine its features. For this test, I've created a project based on the following diagram: It all starts with an Angular application that sends HTTP requests to a Python application. The Python application, in turn, sends requests to .NET and Node.js applications, each working according to its own scenario. The .NET application sends requests to Azure Service Bus and handles them in the Report Service, also sending metrics about the processed requests. Additionally, .NET sends requests to MS SQL. The Node.js requests go to Azure Blob Queue and Google. This system emulates some workflow. All applications utilize automatic tracing systems to send traces to the collector. Let's begin by dissecting the docker-compose file. version: "2" services: postal-service: build: context: ../Postal Service dockerfile: Dockerfile ports: - "7120:80" environment: - AZURE_EXPERIMENTAL_ENABLE_ACTIVITY_SOURCE=true depends_on: - mssql report-service: build: context: ../Report dockerfile: Dockerfile -"7133:80" environment: - AZURE_EXPERIMENTAL_ENABLE_ACTIVITY_SOURCE=true depends_on: - mssql billing-service: The file contains the setup for multiple BFF (Backend For Frontend) services. Among the commented-out sections, we have Jaeger, which helps visualize traces. ports: - "5000:5000" #jaeger-all-in-one: # image: jaegertracing/all-in-one: latest # ports: # - "16686:16686" # - "14268" # - "14250" There is also Zipkin, another software for trace visualization. # Zipkin zipkin-all-in-one: image: openzipkin/zipkin:latest ports: "9411:9411" MS SQL and the collector are included as well. The collector specifies a config file and various ports to which data can be sent. # Collector otel-collector: image: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:0.51.0 command: ["--config=/etc/otel-collector-config.yaml" ] volumes: - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml ports: - "1888:1888" # pprof extension - "13133:13133" # health_check_extension - "4317:4317" - "4318:4318" - "55678:55679" # zpages extension depends_on: - jaeger-all-in-one - zipkin-all-in-one The config file includes key topics: receivers, exporters, processors, extensions, and the service itself, which acts as the constructor for all of this. otel-collector-config.yaml receivers: otlp: protocols: grpc: http:/ cors: allowed_origins: - http://localhost:4200 max_age: 7200 exporters: prometheus: endpoint: "0.0.0.0:8889" const_labels: label1: value1 logging: There is a single receiver, otlp, which represents the OpenTelemetry Protocol. Other receivers can be added as well (such as Prometheus). The receiver can be configured, and in my example, I set up the allowed_origins. receivers: otrip: protocols: grpc: http: cors: allowed_origins: http://localhost:4200 max_age: 7200 Next are exporters. They allow metrics to be sent to Prometheus. exporters: prometheus: endpoint: "0.0.0.0:8889" const_labels: label1: value1 logging: Then come the extensions. In this case, there is a health_check extension, which serves as an endpoint to check the collector's activity. extensions: health_check: pprof: endpoint: :1888 zpages: endpoint: :55679 Lastly, we have a service with pipelines, traces, and metrics. This section clarifies the data type, its source, processing, and destination. In this example, traces from the receiver are sent for logging to two backends, while metrics are sent to Prometheus. service extensions: [pprof, zpages, health_check] pipelines: traces: receivers: [otlp] processors: [batch] exporters: [zipkin, jaeger] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus] Now, let's see how it works in practice. The frontend sends requests to the backend, and the backend uses BFF to send requests. We create a trace and observe the results. Among them, we see some requests with a 500 status. To understand what went wrong, we look at the traces through Zipkin. The detailed description of the problematic request shows that the frontend called BFF, which then sent two synchronous requests, one after the other. Through the traces, we can learn where this request was directed, the URL it targeted, and the HTTP method used. All this information is generated based on automatic data. Additionally, manual traces can be added to make the infographic more informative. Additionally, we see that BFF called BILLINGSERVICE. In it, there are middleware processes, requests sent to Azure, and an HTTP POST request that was sent to Azure, resulting in a CREATED status. The system also sets up and sends requests to Google. There is also POSTALSERVICE, where one request failed. Taking a closer look, we see the error description: "ServiceBusSender has already been closed...". Therefore, one must be cautious with ServiceBusSender in the future. Here, we can also observe multiple requests being sent to MS SQL. Finally, we obtain a comprehensive infographic of all the processes in the system. However, I want to warn you that things are not always as transparent. In our case, two traces, as they say, are "out of context." Nothing is clear about them: where they are executed, what happens with them, and there are minimal details. Sometimes, this happens, and you need to be prepared. As an option, you can add manual traces. Let's take a look at how metrics are sent to Prometheus. The illustration shows that the additional request was successfully sent. There was one request, and now there are five. Therefore, metrics are working properly. In the .NET application, requests are sent to Azure Service Bus, and they are processed by the Report Service. However, in Zipkin, there was no Report Service. Nevertheless, the metrics show that it is functioning. So, remember that not everything in OTLP works as expected everywhere. I know libraries that add traces to message brokers by default, and you can see them in the stack. However, this functionality is still considered experimental. Let's not forget about health_check. It shows whether our collector is functioning. {["status":"Server available","upSince": "2022-06-17T15:49:00.320594Z","uptime": "56m4.4995003s"} Now let's send data to Jaeger as well (by adding a new trace resource). After starting it, we need to resend the requests since it does not receive previous data. We receive a list of services like this: We have similar traces to those in Zipkin, including ones with a 500 status. I personally like the System Architecture tab, which displays a system graph. It shows that everything starts with a request to BFF, which is then redirected to BillingService and PostalService. This exemplifies how different tools display data in their unique ways. Lastly, let's discuss the order. In it, we can find the request and the generated trace ID. If you specify this trace ID in the system, you can learn what happened in the request and thoroughly investigate the HTTP call. This way, the frontend learns that it is the first to receive the User Action. In the same way, the frontend understands that it needs to create a trace that will be passed along the chain and send data to the collector. The collector collects and sends the data to Jaeger, Zipkin, and Prometheus. Therefore, the advantages of using the OpenTelemetry Protocol are evident. It is a flexible system for collecting, processing, and sending telemetry. It is particularly convenient in combination with Docker, which I used in creating this demo. However, always remember the limitations of OTLP. When it comes to traces, everything works quite well. However, the feasibility of using this protocol for metrics and logs depends on the readiness of specific system libraries and SDKs.
2023 has seen rapid growth in native-cloud applications and platforms. Organizations are constantly striving to maximize the potential of their applications, ensure seamless user experiences, and drive business growth. The rise of hybrid cloud environments and the adoption of containerization technologies, such as Kubernetes, have revolutionized the way modern applications are developed, deployed, and scaled. In this digital arena, Kubernetes is the platform of choice for most cloud-native applications and workloads, which is adopted across industries. According to a 2022 report, 96% of companies are already either using or evaluating the implementation of Kubernetes in their cloud system. This popular open-source utility is helpful for container orchestration and discovery, load balancing, and other capabilities. However, with this transformation comes a new set of challenges. As the complexity of applications increases, so does the need for robust observability solutions that enable businesses to gain deep insights into their containerized workloads. Enter Kubernetes observability—a critical aspect of managing and optimizing containerized applications in hybrid cloud environments. In this blog post, we will delve into Kubernetes observability, exploring six effective strategies that can empower businesses to unlock the full potential of their containerized applications in hybrid cloud environments. These strategies, backed by industry expertise and real-world experiences, will equip you with the tools and knowledge to enhance the observability of your Kubernetes deployments, driving business success. Understanding Observability in Kubernetes Let us first start with the basics. Kubernetes is a powerful tool for managing containerized applications. But despite its powerful features, keeping track of what's happening in a hybrid cloud environment can be difficult. This is where observability comes in. Observability is collecting, analyzing, and acting on data in a particular environment. In the context of Kubernetes, observability refers to gaining insights into the behavior, performance, and health of containerized applications running within a Kubernetes cluster. Kubernetes Observability is based on three key pillars: 1. Logs: Logs provide valuable information about the behavior and events within a Kubernetes cluster. They capture important details such as application output, system errors, and operational events. Analyzing logs helps troubleshoot issues, understand application behavior, and identify patterns or anomalies. 2. Metrics: Metrics are quantitative measurements that provide insights into a Kubernetes environment's performance and resource utilization. They include CPU usage, memory consumption, network traffic, and request latency information. Monitoring and analyzing metrics help identify performance bottlenecks, plan capacity, and optimize resource allocation. 3. Traces: Traces enable end-to-end visibility into the flow of requests across microservices within a Kubernetes application. Distributed tracing captures timing data and dependencies between different components, providing a comprehensive understanding of request paths. Traces help identify latency issues, understand system dependencies, and optimize critical paths for improved application performance. Kubernetes observability processes typically involve collecting and analyzing data from various sources to understand the system's internal state and provide actionable intelligence. By implementing the right observability strategies, you can gain a deep understanding of your applications and infrastructure, which will help you to: Detect and troubleshoot problems quickly Improve performance and reliability Optimize resource usage Meet compliance requirements Observability processes are being adopted at a rapid pace by IT teams. By 2026, 70% of organizations will have successfully applied observability to achieve shorter latency for decision-making while increasing distributed, organized, and simplified data management processes. 1. Use Centralized Logging and Log Aggregation For gaining insights into distributed systems, centralized logging is an essential strategy. In Kubernetes environments, where applications span multiple containers and nodes, collecting and analyzing logs from various sources becomes crucial. Centralized logging involves consolidating logs from different components into a single, easily accessible location. The importance of centralized logging lies in its ability to provide a holistic view of your system's behavior and performance. With Kubernetes logging, you can correlate events and identify patterns across your Kubernetes cluster, enabling efficient troubleshooting and root-cause analysis. To implement centralized logging in Kubernetes, you can leverage robust log aggregation tools or cloud-native solutions like Amazon CloudWatch Logs or Google Cloud Logging. These tools provide scalable and efficient ways to collect, store, and analyze logs from your Kubernetes cluster. 2. Leverage Distributed Tracing for End-to-End Visibility In a complex Kubernetes environment with microservices distributed across multiple containers and nodes, understanding the flow of requests and interactions between different components becomes challenging. This is where distributed tracing comes into play, providing end-to-end visibility into the execution path of requests as they traverse through various services. Distributed tracing allows you to trace a request's journey from its entry point to all the microservices it touches, capturing valuable information about each step. By instrumenting your applications with tracing libraries or agents, you can generate trace data that reveals each service's duration, latency, and potential bottlenecks. The benefits of leveraging distributed tracing in Kubernetes are significant. Firstly, it helps you understand the dependencies and relationships between services, enabling better troubleshooting and performance optimization. When a request experiences latency or errors, you can quickly identify the service or component responsible and take corrective actions. Secondly, distributed tracing allows you to measure and monitor the performance of individual services and their interactions. By analyzing trace data, you can identify performance bottlenecks, detect inefficient resource usage, and optimize the overall responsiveness of your system. This information is invaluable with regard to capacity planning and ensuring scalability in your Kubernetes environment. Several popular distributed tracing solutions are available. These tools provide the necessary instrumentation and infrastructure to effectively collect and visualize trace data. By integrating these solutions into your Kubernetes deployments, you can gain comprehensive visibility into the behavior of your microservices and drive continuous improvement. 3. Integrate Kubernetes With APM Solutions To achieve comprehensive observability in Kubernetes, it is essential to integrate your environment with Application Performance Monitoring (APM) solutions. APM solutions provide advanced monitoring capabilities beyond traditional metrics and logs, offering insights into the performance and behavior of individual application components. One of the primary benefits of APM integration is the ability to detect and diagnose performance bottlenecks within your Kubernetes applications. With APM solutions, you can trace requests as they traverse through various services and identify areas of high latency or resource contention. Armed with this information, you can take targeted actions to optimize critical paths and improve overall application performance. Many APM solutions offer dedicated Kubernetes integrations that streamline the monitoring and management of containerized applications. These integrations provide pre-configured dashboards, alerts, and instrumentation libraries that simplify capturing and analyzing APM data within your Kubernetes environment. 4. Use Metrics-Based Monitoring Metrics-based monitoring forms the foundation of observability in Kubernetes. It involves collecting and analyzing key metrics that provide insights into your Kubernetes clusters and applications' health, performance, and resource utilization. When it comes to metrics-based monitoring in Kubernetes, there are several essential components to consider: Node-Level Metrics: Monitoring the resource utilization of individual nodes in your Kubernetes cluster is crucial for capacity planning and infrastructure optimization. Metrics such as CPU usage, memory usage, disk I/O, and network bandwidth help you identify potential resource bottlenecks and ensure optimal allocation. Pod-Level Metrics: Pods are the basic units of deployment in Kubernetes. Monitoring metrics related to pods allows you to assess their resource consumption, health, and overall performance. Key pod-level metrics include CPU and memory usage, network throughput, and request success rates. Container-Level Metrics: Containers within pods encapsulate individual application components. Monitoring container-level metrics helps you understand the resource consumption and behavior of specific application services or processes. Metrics such as CPU usage, memory usage, and file system utilization offer insights into container performance. Application-Specific Metrics: Depending on your application's requirements, you may need to monitor custom metrics specific to your business logic or domain. These metrics could include transaction rates, error rates, cache hit ratios, or other relevant performance indicators. Metric-based monitoring architecture diagram 5. Use Custom Kubernetes Events for Enhanced Observability Custom events communicate between Kubernetes components and between Kubernetes and external systems. They can signal important events, such as deployments, scaling operations, configuration changes, or even application-specific events within your containers. By leveraging custom events, you can achieve several benefits in terms of observability: Proactive Monitoring: Custom events allow you to define and monitor specific conditions that require attention. For example, you can create events to indicate when resources are running low, when pods experience failures, or when specific thresholds are exceeded. By capturing these events, you can proactively detect and address issues before they escalate. Contextual Information: Custom events can include additional contextual information that helps troubleshoot and analyze root causes. You can attach relevant details, such as error messages, timestamps, affected resources, or any other metadata that provides insights into the event's significance. This additional context aids in understanding and resolving issues more effectively. Integration with External Systems: Kubernetes custom events can be consumed by external systems, such as monitoring platforms or incident management tools. Integrating these systems allows you to trigger automated responses or notifications based on specific events. This streamlines incident response processes and ensures the timely resolution of critical issues. To leverage custom Kubernetes events, you can use Kubernetes event hooks, custom controllers, or even develop your event-driven applications using the Kubernetes API. By defining event triggers, capturing relevant information, and reacting to events, you can establish a robust observability framework that complements traditional monitoring approaches. 6. Incorporating Synthetic Monitoring for Proactive Observability Synthetic monitoring simulates user journeys or specific transactions that represent everyday interactions with your application. These synthetic tests can be scheduled to run regularly from various geographic locations, mimicking user behavior and measuring key performance indicators. There are several key benefits to incorporating synthetic monitoring in your Kubernetes environment: Proactive Issue Detection: Synthetic tests allow you to detect issues before real users are affected. By regularly simulating user interactions, you can identify performance degradations, errors, or unresponsive components. This early detection enables you to address issues proactively and maintain high application availability. Performance Benchmarking: Synthetic monitoring provides a baseline for performance benchmarking and SLA compliance. You can measure response times, latency, and availability under normal conditions by running consistent tests from different locations. These benchmarks serve as a reference for detecting anomalies and ensuring optimal performance. Geographic Insights: Synthetic tests can be configured to run from different geographic locations, providing insights into the performance of your application from various regions. This helps identify latency issues or regional disparities that may impact user experience. By optimizing your application's performance based on these insights, you can ensure a consistent user experience globally. You can leverage specialized tools to incorporate synthetic monitoring into your Kubernetes environment. These tools offer capabilities for creating and scheduling synthetic tests, monitoring performance metrics, and generating reports. An approach for gaining Kubernetes observability for traditional and microservice-based applications is by using third-party tools like Datadog, Splunk, Middleware, and Dynatrace. This tool captures metrics and events, providing several out-of-the-box reports, charts, and alerts to save time. Wrapping Up This blog explored six practical strategies for achieving Kubernetes observability in hybrid cloud environments. By utilizing centralized logging and log aggregation, leveraging distributed tracing, integrating Kubernetes with APM solutions, adopting metrics-based monitoring, incorporating custom Kubernetes events, and synthetic monitoring, you can enhance your understanding of the behavior and performance of your Kubernetes deployments. Implementing these strategies will provide comprehensive insights into your distributed systems, enabling efficient troubleshooting, performance optimization, proactive issue detection, and improved user experience. Whether you are operating a small-scale Kubernetes environment or managing a complex hybrid cloud deployment, applying these strategies will contribute to the success and reliability of your applications.
APISIX has a health check mechanism that proactively checks the health status of the upstream nodes in your system. Also, APISIX integrates with Prometheus through its plugin that exposes upstream nodes (multiple instances of a backend API service that APISIX manages) health check metrics on the Prometheus metrics endpoint typically, on URL path /apisix/prometheus/metrics. In this article, we'll guide you on how to enable and monitor API health checks using APISIX and Prometheus. Prerequisite(s) This guide assumes the following tools are installed locally: Before you start, it is good to have a basic understanding of APISIX. Familiarity with API gateway, and its key concepts such as routes, upstream, Admin API, plugins, and HTTP protocol will also be beneficial. Docker is used to install the containerized etcd and APISIX. Install cURL to send requests to the services for validation. Start the APISIX Demo Project This project leverages existing the pre-defined Docker Compose configuration file to set up, deploy and run APISIX, etcd, Prometheus, and other services with a single command. First, clone the apisix-prometheus-api-health-check repo on GitHub and open it in your favorite editor, and start the project by simply running docker compose up from the project root folder. When you start the project, Docker downloads any images it needs to run. You can see the full list of services in docker-compose.yaml file. Add Health Check API Endpoints in Upstream To check API health periodically, APISIX needs an HTTP path of the health endpoint of the upstream service. So, you need first to add /health endpoint for your backend service. From there, you inspect the most relevant metrics for that service such as memory usage, database connectivity, response duration, and more. Assume that we have two backend REST API services web1 and web2 running using the demo project and each has its own health check endpoint at URL path /health. At this point, you do not need to make additional configurations. In reality, you can replace them with your backend services. The simplest and standardized way to validate the status of a service is to define a new health check endpoint like /health or /status Setting Up Health Checks in APISIX This process involves checking the operational status of the 'upstream' nodes. APISIX provides two types of health checks: Active checks and Passive Checks respectively. Read more about Health Checks and how to enable them here. Use the Admin API to create an Upstream object. Here is an example of creating an Upstream object with two nodes (Per each backend service we defined) and configuring the health check parameters in the upstream object: curl "http://127.0.0.1:9180/apisix/admin/upstreams/1" -H "X-API-KEY: edd1c9f034335f136f87ad84b625c8f1" -X PUT -d ' { "nodes": { "web1:80": 1, "web2:80": 1 }, "checks": { "active": { "timeout": 5, "type": "http", "http_path": "/health", "healthy": { "interval": 2, "successes": 1 }, "unhealthy": { "interval": 1, "http_failures": 2 } } } }' This example configures an active health check on the /health endpoint of the node. It considers the node healthy after one successful health check and unhealthy after two failed health checks. Note that sometimes you might need the IP addresses of upstream nodes, not their domains (web1 and web2) if you are running services outside docker network. It is by design that the health check will be started only if the number of nodes (resolved IPs) is bigger than 1. Enable the Prometheus Plugin Create a global rule to enable the prometheus plugin on all routes by adding "prometheus": {} in the plugins option. APISIX gathers internal runtime metrics and exposes them through port 9091 and URI path /apisix/prometheus/metrics by default that Prometheus can scrape. It is also possible to customize the export port and URI path, add extra labels, the frequency of these scrapes, and other parameters by configuring them in the Prometheus configuration /prometheus_conf/prometheus.ymlfile. curl "http://127.0.0.1:9180/apisix/admin/global_rules" -H "X-API-KEY: edd1c9f034335f136f87ad84b625c8f1" -X PUT -d '{ "id": "rule-for-metrics", "plugins": { "prometheus":{} } }' Create a Route Create a Route object to route incoming requests to upstream nodes: curl "http://127.0.0.1:9180/apisix/admin/routes/1" -H "X-API-KEY: edd1c9f034335f136f87ad84b625c8f1" -X PUT -d ' { "name": "backend-service-route", "methods": ["GET"], "uri": "/", "upstream_id": "1" }' Send Validation Requests to the Route To generate some metrics, you try to send a few requests to the route we created in the previous step: curl -i -X GET "http://localhost:9080/" If you run the above requests a couple of times, you can see from responses that APISX routes some requests to node2 and others to node2. That’s how Gateway load balancing works! HTTP/1.1 200 OK Content-Type: text/plain; charset=utf-8 Content-Length: 10 Connection: keep-alive Date: Sat, 22 Jul 2023 10:16:38 GMT Server: APISIX/3.3.0 hello web2 ... HTTP/1.1 200 OK Content-Type: text/plain; charset=utf-8 Content-Length: 10 Connection: keep-alive Date: Sat, 22 Jul 2023 10:16:39 GMT Server: APISIX/3.3.0 hello web1 Collecting Health Check Data With the Prometheus Plugin Once the health checks and route are configured in APISIX, you can employ Prometheus to monitor health checks. APISIX automatically exposes health check metrics data for your APIs if the health check parameter is enabled for upstream nodes. You will see metrics in the response after fetching them from APISIX: curl -i http://127.0.0.1:9091/apisix/prometheus/metrics Example Output: # HELP apisix_http_requests_total The total number of client requests since APISIX started # TYPE apisix_http_requests_total gauge apisix_http_requests_total 119740 # HELP apisix_http_status HTTP status codes per service in APISIX # TYPE apisix_http_status counter apisix_http_status{code="200",route="1",matched_uri="/",matched_host="",service="",consumer="",node="172.27.0.5"} 29 apisix_http_status{code="200",route="1",matched_uri="/",matched_host="",service="",consumer="",node="172.27.0.7"} 12 # HELP apisix_upstream_status Upstream status from health check # TYPE apisix_upstream_status gauge apisix_upstream_status{name="/apisix/upstreams/1",ip="172.27.0.5",port="443"} 0 apisix_upstream_status{name="/apisix/upstreams/1",ip="172.27.0.5",port="80"} 1 apisix_upstream_status{name="/apisix/upstreams/1",ip="172.27.0.7",port="443"} 0 apisix_upstream_status{name="/apisix/upstreams/1",ip="172.27.0.7",port="80"} 1 Health check data is represented with metrics label apisix_upstream_status. It has attributes like upstream name, ip and port. A value of 1 represents healthy and 0 means the upstream node is unhealthy. Visualize the Data in the Prometheus Dashboard Navigate to http://localhost:9090/ where the Prometheus instance is running in Docker and type Expression apisix_upstream_status in the search bar. You can also see the output of the health check statuses of upstream nodes on the Prometheus dashboard in the table or graph view: Cleanup Once you are done experimenting with Prometheus and APISIX Gateway health check metrics, you can use the following commands to stop and remove the services created in this guide: docker compose down Next Steps You have now learned how to set up and monitor API health checks with Prometheus and APISIX. APISIX Prometheus plugin is configured to connect Grafana automatically to visualize metrics. Keep exploring the data and customize the Grafana dashboard by adding a panel that shows the number of active health checks.
If you're running a software system, you need to know what’s happening with it: how it’s performing, whether it’s running as expected, and whether any issues need your attention. And once you spot an issue, you need information so you can troubleshoot. A plethora of tools promises to help with this, from monitoring, APMs, Observability, and everything in between. This has resulted in something of a turf war in the area of observability, where monitoring vendors claims they also do observability, while “observability-first” players disagree and accuse them of observability-washing. So let's take an unbiased look at this and answer a few questions: How are monitoring and observability different, if at all? How effective is each at solving the underlying problem? How does AI impact this space now, and what comes next? What Is Monitoring? A monitoring solution performs three simple actions: Pre-define some "metrics" in advance. Deploy agents to collect these metrics. Display these metrics in dashboards. Note that a metric here is a simple number that captures a quantifiable characteristic of a system. We can then perform mathematical operations on metrics to get different aggregate views. Monitoring has existed for the past 40 years — since the rise of computing systems — and was originally how operations teams kept track of how their infrastructure was behaving. Types of Monitoring Originally, monitoring was most heavily used in infrastructure to keep track of infrastructure behavior - this was infrastructure monitoring. Over time, as applications became more numerous and diverse, we wanted to monitor them as well, leading to the emergence of a category called APM (Application performance monitoring). In a modern distributed system, we have several components we want to monitor — infrastructure, applications, databases, networks, data streams, and so on, and the metrics we want differ depending on the component. For instance: Infrastructure monitoring: uptime, CPU utilization, memory utilization. Application performance monitoring: throughput, error rate, latency. Database monitoring: number of connections, query performance, cache hit ratios. Network monitoring: roundtrip time, TCP retransmits, connection churn. ..and so on These metrics are measures that are generally agreed upon as relevant for that system, and most monitoring tools come pre-built with agents that know which metric to collect and what dashboards to display. As the number of components in distributed systems multiplied, the volume and variety of metrics grew exponentially. To manage this complexity, a separate suite of tools and processes emerged that expanded upon traditional monitoring tools, with time-series databases, SLO systems, and new visualizations. Distinguishing Monitoring Through all this, the core functioning of a monitoring system remains the same, and a monitoring system can be clearly distinguished if: It captures predefined data. The data being collected is a metric (a number). The Goal of Monitoring The goal of a monitoring tool is to alert us when something unexpected is happening in a system. This is akin to an annual medical checkup - we measure a bunch of pre-defined values that will give us an overall picture of our body and let us know if any particular sub-system (organ) is behaving unexpectedly. And just like annual checkups, a monitoring tool may or may not provide any additional information about why something is unexpected. For that, we’ll likely need deeper, more targeted tests and investigations. An experienced physician might still be able to diagnose a condition based on just the overall test, but that is not what the test is designed for. Same with a monitoring solution. What Is Observability? Unlike monitoring, observability is much harder to define. This is because the goal of observability is fuzzier. It is to "help us understand why something is behaving unexpectedly." Logs are the original observability tool that we've been using since the 70s. How we worked until the late 2000s was - traditional monitoring systems would alert us when something went wrong, and logs would help us understand why. However, in the last 15 years, our architectures have gotten significantly more complex. It became near impossible to manually scour logs to figure out what happened. At the same time, our tolerance for downtime decreased dramatically as businesses became more digital, and we could no longer afford to spend hours understanding and fixing issues. We needed more data than we had, so we could troubleshoot issues faster. This led to the rise of the observability industry, whose purpose was to help us understand more easily why our systems were misbehaving. This started with the addition of a new data type called traces, and we said the three pillars of observability were metrics, logs, and traces. Then from there, we kept adding new data types to " improve our observability." The Problem With Observability The fundamental problem with observability is; we don't know what information we might need beforehand. The data we need depends on the issue. The nature of production errors is that they are unexpected and long-tail: if they could've been foreseen, they’d have been fixed already. This is what makes observability fuzzy: there’s no clear scope around what and how much to capture. So observability became "any data that could potentially help us understand what is happening." Today, the best way to describe observability as it is implemented is; "Everything outside of metrics, plus metrics." Monitoring vs. Observability A perfectly observable system would record everything that happens in production with no data gaps. Thankfully, that is impractical and prohibitively expensive, and 99% of the data would be irrelevant anyway, so an average observability platform needs to make complex choices on what and how much telemetry data to capture. Different vendors view this differently, and depending on who you ask, observability seems slightly different. Commonly Cited Descriptions of Observability Are Unhelpful Common articulations of observability, like "observability is being able to observe internal states of a system through its external outputs," are vague and give us neither a clear indication of what it is nor guide us in deciding whether we have sufficient observability for our needs. In addition, most of the commonly cited markers that purport to distinguish observability from monitoring are also vague, if not outright misleading. Let’s look at a few examples: 1. “Monitoring Is Predefined Data; Observability Is Not” In reality, nearly everything we capture in an observability solution today is also predetermined. We define in advance what logs we want to capture, what distributed traces we want to capture (including sampling mechanisms), what context to attach to each distributed trace, and when to capture a stack trace. We're yet to enter the era of tools that selectively capture data based on what is actually happening in production. 2. “Monitoring Is Simple Dashboards; Observability Is More Complex Analysis and Correlation.” This is another promise that’s still unmet in practice. Most observability platforms today also just have dashboards — just that their dashboards show more data than metrics (for example, strings for logs) or can pull up different charts and views based on user instructions. We don't yet have tools that can do any meaningful correlation context by themselves to help us understand problems faster. Being able to connect a log and a trace using a unique ID doesn’t qualify as a complex analysis or correlation, even though the effort required for it may be non-trivial. 3. “Monitoring Is Reactive; Observability Is Proactive.” All observability data we collect is pre-defined, and nearly everything we do in production today (including around observability) is reactive. The proactive part was what we did while testing. In production, if something breaks and/or looks unexpected, we respond and investigate. At best, we use SLO systems which could potentially qualify as proactive. With SLO systems, we predefine an acceptable amount of errors (error budgets) and take action before we surpass them. However, SLO systems are more tightly coupled with monitoring tools, so this is not a particularly relevant distinction between a monitoring and observability solution. 4. “Monitoring Focuses on Individual Components; Observability Reveals Relationships Across Components.” This is a distinction created just to make observability synonymous with distributed tracing. Distributed tracing is just one more data type that shows us the relationships across components. Today, distributed tracing must be used in conjunction with other data to be useful. In summary, we have a poorly defined category with no outer boundaries. Then we made up several vague, not very helpful markers to distinguish that category from monitoring, which existed before. This narrative is designed to tell us that there's always some distance to go before we get to "true observability" —and always one more tool to buy. As a result, we’re continuously expanding the scope of what we need within observability. What Is the Impact of This? Ever Increasing the List of Data Types for Observability All telemetry data is observability because it helps us "observe" the states of our system. Do logs qualify as observability? Yes, because they help us understand what happened in production. Does distributed tracing qualify? Yes. How about error monitoring systems that capture stack traces for exceptions? Yes. How about live debugging systems? Yes. How about continuous profilers? Yes. How about metrics? Also, yes, because they also help us understand the state of our systems. Ever-Increasing Volume of Observability Data How much data to capture is left to the customer to decide, especially outside of monitoring. How much you want to log, how many distributed traces you want to capture, how many events you want to capture and store, at what intervals, for how long — everything is an open question, with limited guidance on how much is "reasonable" and at what point you might be capturing too much. Companies can spend $1M or as much as $65M on observability; it all depends on who builds what business case. Tool Sprawl and Spending Increase All of the above has led to the amount spent on observability rising rapidly. Most companies today use five or more observability tools, and monitoring & observability is typically the second-largest infrastructure spend in a company after cloud infrastructure itself, with a market size of ~$17B. Fear and Loss-Aversion Are Underlying Drivers for Observability Expansion The underlying human driver for the adoption of all these tools is fear -"What if something breaks and I don't have enough data to be able to troubleshoot"? This is every engineering team's worst nightmare. This naturally drives teams to capture more and more telemetry data every year so they feel more secure. Yet MTTR Appears to Be Increasing Globally One would expect that with the wide adoption of observability and the aggressive capturing and storing of various types of observability data, MTTR would have dropped dramatically globally. On the contrary, it appears to be increasing, with 73% of companies taking more than an hour to resolve production issues (vs 47% just two years ago). Despite all the investment, we seem to be making incremental progress at best. Increasing production MTTRs Where We Are Now So far, we continued to collect more and more telemetry data in the hope that processing and storage costs would keep dropping to support that. But with exploding data volumes, we ran into a new problem outside of cost, which is usability. It was getting impossible for a human to directly look at 10s of dashboards and arrive at conclusions quickly enough. So we created different data views and cuts to make it easier for users to test and validate their hypotheses. But these tools have become too complex for an average engineer to use, and we need specially trained "power users" (akin to data scientists) who are well versed in navigating this pool of data to identify an error. This is the approach many observability companies are taking today: capture more data, have more analytics, and train power users who are capable of using these tools. But these specialized engineers do not have enough information about all the parts of the system to be able to generate good-enough hypotheses. Meanwhile, the average engineer continues to rely largely on logs to debug software issues, and we make no meaningful improvement in MTTR. So all of observability seems like a high-effort, high-spend activity that allows us merely to stay in the same place as our architectures rapidly grow in complexity. So what’s next? Monitoring, observability, and inferencing Inferencing: The Next Stage After Observability? To truly understand what the next generation would look like, let us start with the underlying goal of all these tools. It is to keep production systems healthy and running as expected and, if anything goes wrong, to allow us to quickly understand why and resolve the issue. If we start there, we can see that are three distinct levels in how tools can support us: Level 1: "Tell me when something is off in my system" — monitoring. Level 2: "Tell me why something is off (and how to fix it)" — let's call this inferencing. Level 3: "Fix it yourself and tell me what you did" — auto-remediation. Traditional monitoring tools do Level 1 reasonably well and help us detect issues. We have not yet reached Level 2, where a system can automatically tell us why something is breaking. So we introduced a set of tools called observability that sit somewhere between Level 1 and Level 2 to "help understand why something is breaking” by giving us more data. Monitoring, observability, and inferencing Inferencing — Observability Plus AI I'd argue the next step after observability is Inferencing — where a platform can reasonably explain why an error occurred so that we can fix it. This becomes possible now in 2023 with the rapid evolution of AI models over the last few months. Imagine a solution that: Automatically surfaces just the errors that need immediate developer attention. Tells the developer exactly what is causing the issue and where the issue is: this pod, this server, this code path, this line of code, for this type of request. Guides the developer on how to fix it. Uses the developer's actual actions to improve its recommendations continuously. Avoiding the Pitfalls of AIOps In any conversation around AI + observability, it’s important to remember that this has been attempted before with AIOps, with limited success. It will be important for inferencing solutions to avoid the pitfalls of AIOps. To do that, inferencing solutions would have to be architected ground-up for the AI use-case, i.e., data collection, processing, storage, and user interface are all designed ground-up for root-causing issues using AI. What it will probably NOT look like is AI added on top of existing observability tools and existing observability data, simply because that is what we attempted and failed with AIOPs. Conclusion We explored monitoring and observability and how they differ. We looked at how observability is poorly defined today with loose boundaries, which results in uncontrolled data, tool, and spend sprawl. Meanwhile, the latest progress in AI could resolve some of the issues we have with observability today with a new class of Inferencing solutions based on AI. Watch this space for more on this topic!
Joana Carvalho
Site Reliability Engineering,
Virtuoso
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere
Chris Ward
Zone Leader,
DZone
Ted Young
Director of Open Source Development,
LightStep