Migrate, Modernize and Build Java Web Apps on Azure: This live workshop will cover methods to enhance Java application development workflow.
Modern Digital Website Security: Prepare to face any form of malicious web activity and enable your sites to optimally serve your customers.
Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
A Roadmap to True Observability
Enhancing Observability With AI/ML
In enterprises, SREs, DevOps, and cloud architects often discuss which platform to choose for observability for faster troubleshooting of issues and understanding about performance of their production systems. There are certain questions they need to answer to get maximum value for their team, such as: Will an observability tool support all kinds of workloads and heterogeneous systems? Will the tool support all kinds of data aggregation, such as logs, metrics, traces, topology, etc..? Will the investment in the (ongoing or new) observability tool be justified? In this article, we will provide the best way to get started with unified observability of your entire infrastructure using open-source Skywalking and Istio service mesh. Istio Service Mesh of Multi-Cloud Application Let us take an example of a multi-cloud example where there are multiple services hosted on on-prem or managed Kubernetes clusters. The first step for unified observability will be to form a service mesh using Istio service mesh. The idea is that all the services or workloads in Kubernetes clusters (or VMs) should be accompanied by an Envoy proxy to abstract the security and networking out of business logic. As you can see in the below image, a service mesh is formed, and the network communication between edge to workloads, among workloads, and between clusters is controlled by the Istio control plane. In this case, the Istio service mesh emits a logs, metrics, and traces for each Envoy proxies, which will help to get unified observability. We need a visualization tool like Skywalking to collect the data and populate it for granular observability. Why Skywalking for Observability SREs from large companies such as Alibaba, Lenovo, ABInBev, and Baidu use Apache Skywalking, and the common reasons are: Skywalking aggregates logs, metrics, traces, and topology. It natively supports popular service mesh software like Istio. While other tools may not support getting data from Envoy sidecars, Skywalking supports sidecar integration. It supports OpenTelemetry (OTel) standards for observability. These days, OTel standards and instrumentation are popular for MTL (metrics, logs, traces). Skywalking supports observability-data collection from almost all the elements of the full stack- database, OS, network, storage, and other infrastructure. It is open-source and free (with an affordable enterprise version). Now, let us see how to integrate Istio and Apache skywalking into your enterprise. Steps To Integrate Istio and Apache Skywalking We have created a demo to establish the connection between the Istio data plane and Skywalking, where it will collect data from Envoy sidecars and populate them in the observability dashboards. Note: By default, Skywalking comes with predefined dashboards for Apache APISIX and AWS Gateways. Since we are using Istio Gateway, it will not get a dedicated dashboard out-of-the-box, but we’ll get metrics for it in other locations. If you want to watch the video, check out my latest Istio-Skywalking configuration video. You can refer to the GitHub link here. Step 1: Add Kube-State-Metrics to Collect Metrics From the Kubernetes API Server We have installed kube-state-metrics service to listen to the Kubernetes API server and send those metrics to Apache skywalking. First, add the Prometheus community repo: Shell helm repo add prometheus-community https://prometheus-community.github.io/helm-charts (After every helm repo add, add a line about running helm repo update to fetch the latest charts.) And now you can install kube-state-metrics. Shell helm install kube-state-metrics prometheus-community/kube-state-metrics Step 2: Install Skywalking Using HELM Charts We will install Skywalking version 9.2.0 for this observability demo. You can run the following command to install Skywalking into a namespace (my namespace is skywalking). You can refer to the values.yaml. Shell helm install skywalking oci://registry-1.docker.io.apache/skywalking-helm -f -n skywalking (Optional reading) In helm chart values.yaml, you will notice that: We have made the flag oap (observability analysis platform, i.e., the back-end) and ui configuration as true. Similarly, for databases, we have enabled postgresql as true. For tracking metrics from Envoy access logs, we have configured the following environmental variables: SW_ENVOY_METRIC: default SW_ENVOY_METRIC_SERVICE: true SW_ENVOY_METRIC_ALS_HTTP_ANALYSIS: k8s-mesh,mx-mesh,persistence SW_ENVOY_METRIC_ALS_TCP_ANALYSIS: k8s-mesh,mx-mesh,persistence This is to select the logs and metrics from the Envoy from the Istio configuration (‘c’ and ‘d’ are the rules for analyzing Envoy access logs). We will enable the OpenTelemetry receiver and configure it to receive data in otlp format. We will also enable OTel rules according to the data we will send to Skywalking. In a few moments (in Step 3), we will configure the OTel collector to scrape istiod, k8s, kube-state-metrics, and the Skywalking OAP itself. So, we have enabled the appropriate rules: SW_OTEL_RECEIVER: default SW_OTEL_RECEIVER_ENABLED_HANDLERS: “otlp” SW_OTEL_RECEIVER_ENABLED_OTEL_RULES: “istio-controlplane,k8s-cluster,k8s-node,k8s-service,oap” SW_TELEMETRY: prometheus SW_TELEMETRY_PROMETHEUS_HOST: 0.0.0.0 SW_TELEMETRY_PROMETHEUS_PORT: 1234 SW_TELEMETRY_PROMETHEUS_SSL_ENABLED: false SW_TELEMETRY_PROMETHEUS_SSL_KEY_PATH: “” SW_TELEMETRY_PROMETHEUS_SSL_CERT_CHAIN_PATH: “” We have instructed Skywalking to collect data from the Istio control plance, Kubernetes cluster, node, services, and also oap (Observability Analytics Platform by Skywalking).(The configurations from ‘d’ to ‘i’ enable Skywalking OAP’s self-observability, meaning it will expose Prometheus-compatible metrics at port 1234 with SSL disabled. Again, in Step 3, we will configure the OTel collector to scrape this endpoint.) In the helm chart, we have also enabled the creation of a service account for Skywalking OAP. Step 3: Setting Up Istio + Skywalking Configuration After that, we can install Istio using this IstioOperator configuration. In the IstioOperator configuration, we have set up the meshConfig so that every Sidecar will have enabled the envoy access logs service and set the address for access logs service and metrics service to skywalking. Additionally, with the proxyStatsMatcher, we are configuring all metrics to be sent via the metrics service. YAML meshConfig: defaultConfig: envoyAccessLogService: address: "skywalking-skywalking-helm-oap.skywalking.svc:11800" envoyMetricsService: address: "skywalking-skywalking-helm-oap.skywalking.svc:11800" proxyStatsMatcher: inclusionRegexps: - .* enableEnvoyAccessLogService: true Step 4: OpenTelemetry Collector Once the Istio and Skywalking configuration is done, we need to feed metrics from applications, gateways, nodes, etc, to Skywalking. We have used the opentelemetry-collector.yaml to scrape the Prometheus compatible endpoints. In the collector, we have mentioned that OpenTelemetry will scrape metrics from istiod, Kubernetes-cluster, kube-state-metrics, and skywalking. We have created a service account for OpenTelemetry. Using opentelemetry-serviceaccount.yaml, we have set up a service account, declared ClusterRole and ClusterRoleBinding to define what all actions the opentelemetry service account will be able to take on various resources in our Kubernetes cluster. Once you deploy the opentelemetry-collector.yaml and opentelemetry-serviceaccount.yaml, there will be data flowing into Skywalking from- Envoy, Kubernetes cluster, kube-state-metrics, and Skywalking (oap). Step 5: Observability of Kubernetes Resources and Istio Resource in Skywalking To check the UI of Skywalking, port-forward the Skywalking UI service to port (say 8080). Run the following command: Shell kubectl port-forward svc/skywalking-skywalking-helm-ui -n skywalking 8080:80 You can open the Skywalking UI service at localhost:8080. (Note: For setting up load to services and see the behavior and performance of apps, cluster, and Envoy proxy, check out the full video. ) Once you are on the Skywalking UI (refer to the image below), you can select service mesh in the left-side menu and select control plane or data plane. Skywalking would provide all the resource consumption and observability data of Istio control and data plane, respectively. Skywalking Istio-dataplane provides info about all the Envoy proxies attached to services. Skywalking provides metrics, logs, and traces of all the Envoy proxies. Refer to the below image, where all the observability details are displayed for just one service-proxy. Skywalking provides the resource consumption of Envoy proxies in various namespaces. Similarly, Skywalking also provides all the observable data of the Istio control plane. Note, in case you have multiple control planes in different namespaces (in multiple clusters), you just provide the access Skywalking oap service. Skywalking provides Istio control planes like metrics, number of pilot pushes, ADS monitoring, etc. Apart from the Istio service mesh, we also configured Skywalking to fetch information about the Kubernetes cluster. You can see in the below image Skywalking provides all the info about the Kubernetes dashboard, such as the number of nodes, pods, K8s deployments, services, pods, containers, etc. You also get the respective resource utilization metrics of each K8 resource in the same dashboard. Skywalking provides holistic information about a Kubernetes cluster. Similarly, you can drill further down into a service in the Kubernetes cluster and get granular information about their behavior and performance. (refer to the below images.) For setting up load to services and seeing the behavior and performance of apps, cluster, and Envoy proxy, check out the full video. Benefits of Istio Skywalking Integrations There are several benefits of integrating Istio and Apache Skywalking for Unified observability. Ensure 100% visibility of the technology stack, including apps, sidecars, network, database, OS, etc. Reduce 90% of the time to find the root cause (MTTR) of issues or anomalies in production with faster troubleshooting. Save approximately ~$2M of lifetime spend on closed-source solutions, complex pricing, and custom integrations.
Our industry is in the early days of an explosion in software using LLMs, as well as (separately, but relatedly) a revolution in how engineers write and run code, thanks to generative AI. Many software engineers are encountering LLMs for the very first time, while many ML engineers are being exposed directly to production systems for the very first time. Both types of engineers are finding themselves plunged into a disorienting new world — one where a particular flavor of production problem they may have encountered occasionally in their careers is now front and center. Namely, that LLMs are black boxes that produce nondeterministic outputs and cannot be debugged or tested using traditional software engineering techniques. Hooking these black boxes up to production introduces reliability and predictability problems that can be terrifying. It’s important to understand this, and why. 100% Debuggable? Maybe Not Software is traditionally assumed to be testable, debuggable, and reproducible, depending on the flexibility and maturity of your tooling and the complexity of your code. The original genius of computing was one of constraint; that by radically constraining language and mathematics to a defined set, we could create algorithms that would run over and over and always return the same result. In theory, all software is debuggable. However, there are lots of things that can chip away at that beauteous goal and make your software mathematically less than 100% debuggable, like: Adding concurrency and parallelism. Certain types of bugs. Stacking multiple layers of abstractions (e.g., containers). Randomness. Using JavaScript (HA HA). There is a much longer list of things that make software less than 100% debuggable in practice. Some of these things are related to cost/benefit tradeoffs, but most are about weak telemetry, instrumentation, and tooling. If you have only instrumented your software with metrics, for example, you have no way of verifying that a spike in api_requests and an identical spike in 503 errors are for the same events (i.e., you are getting a lot of api_requests returning 503) or for a disjoint set of events (the spike in api_requests is causing general congestion causing a spike in 503s across ALL events). It is mathematically impossible; all you can do is guess. But if you have a log line that emits both the request_path and the error_code, and a tool that lets you break down and group by arbitrary dimensions, this would be extremely easy to answer. Or if you emit a lot of events or wide log lines but cannot trace them, or determine what order things executed in, there will be lots of other questions you won’t be able to answer. There is another category of software errors that are logically possible to debug, but prohibitively expensive in practice. Any time you see a report from a big company that tracked down some obscure error in a kernel or an ethernet device, you’re looking at one of the rare entities with 1) enough traffic for these one in a billion errors to be meaningful, and 2) enough raw engineering power to dedicate to something most of us just have to live with. But software is typically understandable because we have given it structure and constraints. IF (); THEN (); ELSE () is testable and reproducible. Natural languages, on the other hand, are infinitely more expressive than programming languages, query languages, or even a UI that users interact with. The most common and repeated patterns may be fairly predictable, but the long tail your users will create is very long, and they expect meaningful results there, as well. For complex reasons that we won’t get into here, LLMs tend to have a lot of randomness in the long tail of possible results. So with software, if you ask the exact same question, you will always get the exact same answer. With LLMs, you might not. LLMs Are Their Own Beast Unit testing involves asserting predictable outputs for defined inputs, but this obviously cannot be done with LLMs. Instead, ML teams typically build evaluation systems to evaluate the effectiveness of the model or prompt. However, to get an effective evaluation system bootstrapped in the first place, you need quality data based on real use of an ML model. With software, you typically start with tests and graduate to production. With ML, you have to start with production to generate your tests. Even bootstrapping with early access programs or limited user testing can be problematic. It might be ok for launching a brand new feature, but it’s not good enough for a real production use case. Early access programs and user testing often fail to capture the full range of user behavior and potential edge cases that may arise in real-world usage when there are a wide range of users. All these programs do is delay the inevitable failures you’ll encounter when an uncontrolled and unprompted group of end users does things you never expected them to do. Instead of relying on an elaborate test harness to give you confidence in your software a priori, it’s a better idea to embrace a “ship to learn” mentality and release features earlier, then systematically learn from what is shipped and wrap that back into your evaluation system. And once you have a working evaluation set, you also need to figure out how quickly the result set is changing. Phillip gives this list of things to be aware of when building with LLMs: Failure will happen — it’s a question of when, not if. Users will do things you can’t possibly predict. You will ship a “bug fix” that breaks something else. You can’t really write unit tests for this (nor practice TDD). Latency is often unpredictable. Early access programs won’t help you. Sound at all familiar? Observability-Driven Development Is Necessary With LLMs Over the past decade or so, teams have increasingly come to grips with the reality that the only way to write good software at scale is by looping in production via observability — not by test-driven development, but observability-driven development. This means shipping sooner, observing the results, and wrapping your observations back into the development process. Modern applications are dramatically more complex than they were a decade ago. As systems get increasingly more complex, and nondeterministic outputs and emergent properties become the norm, the only way to understand them is by instrumenting the code and observing it in production. LLMs are simply on the far end of a spectrum that has become ever more unpredictable and unknowable. Observability — both as a practice and a set of tools — tames that complexity and allows you to understand and improve your applications. We have written a lot about what differentiates observability from monitoring and logging, but the most important bits are 1) the ability to gather and store telemetry as very wide events, ordered in time as traces, and 2) the ability to break down and group by any arbitrary, high-cardinality dimension. This allows you to explore your data and group by frequency, input, or result. In the past, we used to warn developers that their software usage patterns were likely to be unpredictable and change over time; now we inform you that if you use LLMs, your data set is going to be unpredictable, and it will absolutely change over time, and you must have a way of gathering, aggregating, and exploring that data without locking it into predefined data structures. With good observability data, you can use that same data to feed back into your evaluation system and iterate on it in production. The first step is to use this data to evaluate the representativity of your production data set, which you can derive from the quantity and diversity of use cases. You can make a surprising amount of improvements to an LLM based product without even touching any prompt engineering, simply by examining user interactions, scoring the quality of the response, and acting on the correctable errors (mainly data model mismatches and parsing/validation checks). You can fix or handle for these manually in the code, which will also give you a bunch of test cases that your corrections actually work! These tests will not verify that a particular input always yields a correct final output, but they will verify that a correctable LLM output can indeed be corrected. You can go a long way in the realm of pure software, without reaching for prompt engineering. But ultimately, the only way to improve LLM-based software is by adjusting the prompt, scoring the quality of the responses (or relying on scores provided by end users), and readjusting accordingly. In other words, improving software that uses LLMs can only be done by observability and experimentation. Tweak the inputs, evaluate the outputs, and every now and again, consider your dataset for representivity drift. Software engineers who are used to boolean/discrete math and TDD now need to concern themselves with data quality, representivity, and probabilistic systems. ML engineers need to spend more time learning how to develop products and concern themselves with user interactions and business use cases. Everyone needs to think more holistically about business goals and product use cases. There’s no such thing as a LLM that gives good answers that don’t serve the business reason it exists, after all. So, What Do You Need to Get Started With LLMs? Do you need to hire a bunch of ML experts in order to start shipping LLM software? Not necessarily. You cannot (there aren’t enough of them), you should not (this is something everyone needs to learn), and you don’t want to (these are changes that will make software engineers categorically more effective at their jobs). Obviously, you will need ML expertise if your goal is to build something complex or ambitious, but entry-level LLM usage is well within the purview of most software engineers. It is definitely easier for software engineers to dabble in using LLMs than it is for ML engineers to dabble in writing production applications. But learning to write and maintain software in the manner of LLMs is going to transform your engineers and your engineering organizations. And not a minute too soon. The hardest part of software has always been running it, maintaining it, and understanding it — in other words, operating it. But this reality has been obscured for many years by the difficulty and complexity of writing software. We can’t help but notice the upfront cost of writing software, while the cost of operating it gets amortized over many years, people, and teams, which is why we have historically paid and valued software engineers who write code more than those who own and operate it. When people talk about the 10x engineer, everyone automatically assumes it means someone who churns out 10x as many lines of code, not someone who can operate 10x as much software. But generative AI is about to turn all of these assumptions upside down. All of a sudden, writing software is as easy as sneezing. Anyone can use ChatGPT or other tools to generate reams of code in seconds. But understanding it, owning it, operating it, extending and maintaining it... all of these are more challenging than ever, because in the past, the way most of us learned to understand software was by writing it. What can we possibly do to make sure our code makes sense and works, and is extendable and maintainable (and our code base is consistent and comprehensible) when we didn’t go through the process of writing it? Well, we are in the early days of figuring that out, too. If you’re an engineer who cares about your craft: Do code reviews. Follow coding standards and conventions. Write (or generate) tests for it. But ultimately, the only way you can know for sure whether or not it works is to ship it to production and watch what happens. This has always been true, by the way. It’s just more true now. If you’re an engineer adjusting to the brave new era: Take some of that time you used to spend writing lines of code and reinvest it back into understanding, shipping under controlled circumstances, and observing. This means instrumenting your code with intention, and inspecting its output. This means shipping as soon as possible into the production environment. This means using feature flags to decouple deploys from releases and gradually roll new functionality out in a controlled fashion. Invest in these — and other — guardrails to make the process of shipping software more safe, fine-grained, and controlled. Most of all, it means developing the habit of looking at your code in production, through the lens of your telemetry, and asking yourself: Does this do what I expected it to do? Does anything else look weird? Or maybe I should say “looking at your systems” instead of “looking at your code,” since people might confuse the latter with an admonition to “read the code.” The days when you could predict how your system would behave simply by reading lines of code are long, long gone. Software behaves in unpredictable, emergent ways, and the important part is observing your code as it’s running in production, while users are using it. Code in a buffer can tell you very little. This Future Is a Breath of Fresh Air This, for once, is not a future I am afraid of. It’s a future I cannot wait to see manifest. For years now, I’ve been giving talks on modern best practices for software engineering — developers owning their code in production, testing in production, observability-driven development, continuous delivery in a tight feedback loop, separating deploys from releases using feature flags. No one really disputes that life is better, code is better, and customers are happier when teams adopt these practices. Yet, only 11% of teams can deploy their code in less than a day, according to the DORA report. Only a tiny fraction of teams are operating in the way everybody agrees we all should! Why? The answers often boil down to organizational roadblocks, absurd security/compliance policies, or lack of buy-in/prioritizing. Saddest of all are the ones who say something like, “our team just isn’t that good” or “our people just aren’t that smart” or “that only works for world-class teams like the Googles of the world.” Completely false. Do you know what’s hard? Trying to build, run, and maintain software on a two month delivery cycle. Running with a tight feedback loop is so much easier. Just Do the Thing So how do teams get over this hump and prove to themselves that they can have nice things? In my experience, only one thing works: When someone joins the team who has seen it work before, has confidence in the team’s abilities, and is empowered to start making progress against those metrics (which they tend to try to do, because people who have tried writing code the modern way become extremely unwilling to go back to the bad old ways). And why is this relevant? I hypothesize that over the course of the next decade, developing with LLMs will stop being anything special, and will simply be one skill set of many, alongside mobile development, web development, etc. I bet most engineers will be writing code that interacts with an LLM. I bet it will become not quite as common as databases, but up there. And while they’re doing that, they will have to learn how to develop using short feedback loops, testing in production, observability-driven development, etc. And once they’ve tried it, they too may become extremely unwilling to go back. In other words, LLMs might ultimately be the Trojan Horse that drags software engineering teams into the modern era of development best practices. (We can hope.) In short, LLMs demand we modify our behavior and tooling in ways that will benefit even ordinary, deterministic software development. Ultimately, these changes are a gift to us all, and the sooner we embrace them, the better off we will be.
The ability to measure the internal states of a system by examining its outputs is called Observability. A system becomes 'observable' when it is possible to estimate the current state using only information from outputs, namely sensor data. You can use the data from Observability to identify and troubleshoot problems, optimize performance, and improve security. In the next few sections, we'll take a closer look at the three pillars of Observability: Metrics, Logs, and Traces. What Is the Difference Between Observability and Monitoring? ‘Observability wouldn’t be possible without monitoring.’ Monitoring is another term that closely relates to observability. The major difference between Monitoring and Observability is that the latter refers to the ability to gain insights into the internal workings of a system, while the former refers to the act of collecting data on system performance and behavior. In addition to that, Monitoring doesn't really think about the end goal. It focuses on predefined metrics and thresholds to detect deviations from expected behavior. Observability aims to provide a deep understanding of system behavior, allowing exploration and discovery of unexpected issues. In terms of perspective and mindset, Monitoring adopts a "top-down" approach with predefined alerts based on known criteria. Observability takes a "bottom-up" approach, encouraging open-ended exploration and adaptability to changing requirements. Observability Monitoring Tells you why a system is at fault. Notifies that you have a system at fault. Acts as a knowledge base to define what needs monitoring. Focuses only on monitoring systems and detecting faults across them. Focuses on giving context to data. Data collection focused. Give a more complete assessment of the overall environment. Keeping track of monitoring KPIs. Observability is a traversable map. Monitoring is a single plane. It gives you complete information. It gives you limited information. Observability creates the potential to monitor different events. Monitoring is the process of using Observability. Monitoring detects anomalies and alerts you to potential problems. However, Observability detects issues and helps you understand their root causes and underlying dynamics. Three Pillars of Observability Observability, built on the Three Pillars (Metrics, Logs, Traces), revolves around the core concept of "Events." Events are the fundamental units of monitoring and telemetry, each time-stamped and quantifiable. What distinguishes events is their context, especially in user interactions. For example, when a user clicks "Pay Now" on an eCommerce site, this action is an event expected within seconds. In monitoring tools, "Significant Events" are key. They trigger: Automated Alerts: Notifying SREs or operations teams. Diagnostic Tools: Enabling root-cause analysis. Imagine a server's disk nearing 99% capacity; it's significant, but understanding which applications and users cause this is vital for effective action. 1. Metrics Metrics serve as numeric indicators, offering insights into a system's health. While some metrics like CPU, memory, and disk usage are obvious system health indicators, numerous other critical metrics can uncover underlying issues. For instance, a gradual increase in OS handles can lead to a system slowdown, eventually necessitating a reboot for accessibility. Similar valuable metrics exist throughout the various layers of the modern IT infrastructure. Careful consideration is crucial when determining which metrics to continuously collect and how to analyze them effectively. This is where domain expertise plays a pivotal role. While most monitoring tools can detect evident issues, the best ones go further by providing insights into detecting and alerting complex problems. It's also essential to identify the subset of metrics that serve as proactive indicators of impending system problems. For instance, an OS handle leak rarely occurs abruptly. Tracking the gradual increase in the number of handles in use over time makes it possible to predict when the system might become unresponsive, allowing for proactive intervention. Advantages of Metrics Challenges of Metrics Quantitative and intuitive for setting alert thresholds Lightweight and cost-effective for storage Excellent for tracking trends and system changes Provides real-time component state data Constant overhead cost; not affected by data surges Limited insight into the "why" behind issues Lack context of individual interactions or events Risk of data loss in case of collection/storage failure Fixed interval collection may miss critical details Excessive sampling can impact performance and costs 2. Logs Logs frequently contain intricate details about how an application processes requests. Unusual occurrences, such as exceptions, within these logs can signal potential issues within the application. It's a vital aspect of any observability solution to monitor these errors and exceptions in logs. Parsing logs can also reveal valuable insights into the application's performance. Logs often hold insights that may remain elusive when using APIs (Application Programming Interfaces) or querying application databases. Many Independent Software Vendors (ISVs) don't offer alternative methods to access the data available in logs. Therefore, an effective observability solution should enable log analysis and facilitate the capture of log data and its correlation with metric and trace data. Advantages of Logs Challenges of Logs Easy to generate, typically timestamp + plain text Often require minimal integration by developers Most platforms offer standardized logging frameworks Human-readable, making them accessible Provide granular insights for retrospective analysis Can generate large data volumes, leading to costs Impact on application performance, especially without asynchronous logging Retrospective use, not proactive Persistence challenges in modern architectures Risk of log loss in containers and auto-scaling environments 3. Traces Tracing is a relatively recent development, especially suited to the complex nature of contemporary applications. It works by collecting information from different parts of the application and putting it together to show how a request moves through the system. A trace is represented as spans: span A is the root span, and span B is a child of span A. The primary advantage of tracing lies in its ability to deconstruct end-to-end latency and attribute it to specific tiers or components. While it can't tell you exactly why there's a problem, it's great for figuring out where to look. Advantages of Traces Challenges of Traces Ideal for pinpointing issues within a service Offers end-to-end visibility across multiple services Identifies performance bottlenecks effectively Aids debugging by recording request/response flows Provides contextual insights into system behavior Limited ability to reveal long-term trends Complex systems may yield diverse trace paths Doesn't explain the cause of slow or failing spans (steps) Adds overhead, potentially impacting system performance Integrating tracing used to be difficult, but with service meshes, it's now effortless. Service meshes handle tracing and stats collection at the proxy level, providing seamless observability across the entire mesh without requiring extra instrumentation from applications within it. Each above discussed component has its pros and cons, even though one might want to use them all. Observability Tools Observability tools gather and analyze data related to user experience, infrastructure, and network telemetry to proactively address potential issues, preventing any negative impact on critical business key performance indicators (KPIs). Observability Survey Report 2023 - key findings Some popular observability tooling options include: Prometheus: A leading open-source monitoring and alerting toolkit known for its scalability and support for multi-dimensional data collection. Grafana: A visualization and dashboarding platform often used with Prometheus, providing rich insights into system performance. Jaeger: An open-source distributed tracing system for monitoring and troubleshooting microservices-based architectures. Elasticsearch: A search and analytics engine that, when paired with Kibana and Beats, forms the ELK Stack for log management and analysis. Honeycomb: An event-driven observability tool that offers real-time insights into application behavior and performance. Datadog: A cloud-based observability platform that integrates logs, metrics, and traces, providing end-to-end visibility. New Relic: Offers application performance monitoring (APM) and infrastructure monitoring solutions to track and optimize application performance. Sysdig: Focused on container monitoring and security, Sysdig provides deep visibility into containerized applications. Zipkin: An open-source distributed tracing system for monitoring request flows and identifying latency bottlenecks. Conclusion Logs, metrics, and traces are essential Observability pillars that work together to provide a complete view of distributed systems. Incorporating them strategically, such as placing counters and logs at entry and exit points and using traces at decision junctures, enables effective debugging. Correlating these signals enhances our ability to navigate metrics, inspect request flows, and troubleshoot complex issues in distributed systems.
This is an article from DZone's 2023 Database Systems Trend Report.For more: Read the Report Hearing the vague statement, "We have a problem with the database," is a nightmare for any database manager or administrator. Sometimes it's true, sometimes it's not, and what exactly is the issue? Is there really a database problem? Or is it a problem with networking, an application, a user, or another possible scenario? If it is a database, what is wrong with it? Figure 1: DBMS usage Databases are a crucial part of modern businesses, and there are a variety of vendors and types to consider. Databases can be hosted in a data center, in the cloud, or in both for hybrid deployments. The data stored in a database can be used in various ways, including websites, applications, analytical platforms, etc. As a database administrator or manager, you want to be aware of the health and trends of your databases. Database monitoring is as crucial as databases themselves. How good is your data if you can't guarantee its availability and accuracy? Database Monitoring Considerations Database engines and databases are systems hosted on a complex IT infrastructure that consists of a variety of components: servers, networking, storage, cables, etc. Database monitoring should be approached holistically with consideration of all infrastructure components and database monitoring itself. Figure 2: Database monitoring clover Let's talk more about database monitoring. As seen in Figure 2, I'd combine monitoring into four pillars: availability, performance, activity, and compliance. These are broad but interconnected pillars with overlap. You can add a fifth "clover leaf" for security monitoring, but I include that aspect of monitoring into activity and compliance, for the same reason capacity planning falls into availability monitoring. Let's look deeper into monitoring concepts. While availability monitoring seems like a good starting topic, I will deliberately start with performance since performance issues may render a database unavailable and because availability monitoring is "monitoring 101" for any system. Performance Monitoring Performance monitoring is the process of capturing, analyzing, and alerting to performance metrics of hardware, OS, network, and database layers. It can help avoid unplanned downtimes, improve user experience, and help administrators manage their environments efficiently. Native Database Monitoring Most, if not all, enterprise-grade database systems come with a set of tools that allow database professionals to examine internal and/or external database conditions and the operational status. These are system-specific, technical tools that require SME knowledge. In most cases, they are point-in-time performance data with limited or non-existent historical value. Some vendors provide additional tools to simplify performance data collection and analysis. With an expansion of cloud-based offerings (PaaS or IaaS), I've noticed some improvements in monitoring data collection and the available analytics and reporting options. However, native performance monitoring is still a set of tools for a database SME. Enterprise Monitoring Systems Enterprise monitoring systems (EMSs) offer a centralized approach to keeping IT systems under systematic review. Such systems allow monitoring of most IT infrastructure components, thus consolidating supervised systems with a set of dashboards. There are several vendors offering comprehensive database monitoring systems to cover some or all your monitoring needs. Such solutions can cover multiple database engines or be specific to a particular database engine or a monitoring aspect. For instance, if you only need to monitor SQL servers and are interested in the performance of your queries, then you need a monitoring system that identifies bottlenecks and contentions. Let's discuss environments with thousands of database instances (on-premises and in a cloud) scattered across multiple data centers across the globe. This involves monitoring complexity growth with a number of monitored devices, database type diversity, and geographical locations of your data centers and actual data that you monitor. It is imperative to have a global view of all database systems under one management and an ability to identify issues, preferably before they impact your users. EMSs are designed to help organizations align database monitoring with IT infrastructure monitoring, and most solutions include an out-of-the-box set of dashboards, reports, graphs, alerts, useful tips, and health history and trends analytics. They also have pre-set industry-outlined thresholds for performance counters/metrics that should be adjusted to your specific conditions. Manageability and Administrative Overhead Native database monitoring is usually handled by a database administrator (DBA) team. If it needs to be automated, expanded, or have any other modifications, then DBA/development teams would handle that. This can be efficiently managed by DBAs in a large enterprise environment on a rudimental level for internal DBA specific use cases. Bringing in a third-party system (like an EMS) requires management. Hypothetically, a vendor has installed and configured monitoring for your company. That partnership can continue, or internal personnel can take over EMS management (with appropriate training). There is no "wrong" approach — it solely depends on your company's operating model and is assessed accordingly. Data Access and Audit Compliance Monitoring Your databases must be secure! Unauthorized access to sensitive data could be as harmful as data loss. Data breaches, malicious activities (intentional or not) — no company would be happy with such publicity. That brings us to audit compliance and data access monitoring. There are many laws and regulations around data compliance. Some are common between industries, some are industry-specific, and some are country-specific. For instance, SOX compliance is required for all public companies in numerous countries, and US healthcare must follow HIPAA regulations. Database management teams must implement a set of policies, procedures, and processes to enforce laws and regulations applicable to their company. Audit reporting could be a tedious and cumbersome process, but it can and should be automated. While implementing audit compliance and data access monitoring, you can improve your database audit reporting, as well — it's virtually the same data set. What do we need to monitor to comply with various laws and regulations? These are normally mandatory: Access changes and access attempts Settings and/or objects modifications Data modifications/access Database backups Who should be monitored? Usually, access to make changes to a database or data is strictly controlled: Privileged accounts – usually DBAs; ideally, they shouldn't be able to access data, but that is not always possible in their job so activity must be monitored Service accounts – either database or application service accounts with rights to modify objects or data "Power" accounts – users with rights to modify database objects or data "Lower" accounts – accounts with read-only activity As with performance monitoring, most database engines provide a set of auditing tools and mechanisms. Another option is third-party compliance software, which uses database-native auditing, logs, and tracing to capture compliance-related data. It provides audit data storage capabilities and, most importantly, a set of compliance reports and dashboards to adhere to a variety of compliance policies. Compliance complexity directly depends on regulations that apply to your company and the diversity and size of your database ecosystem. While we monitor access and compliance, we want to ensure that our data is not being misused. An adequate measure should be in place for when unauthorized access or abnormal data usage is detected. Some audit compliance monitoring systems provide means to block abnormal activities. Data Corruption and Threats Database data corruption is a serious issue that could lead to a permanent loss of valuable data. Commonly, data corruption occurs due to hardware failures, but it could be due to database bugs or even bad coding. Modern database engines have built-in capabilities to detect and sometimes prevent data corruption. Data corruption will generate an appropriate error code that should be monitored and highlighted. Checking database integrity should be a part of the periodical maintenance process. Other threats include intentional or unintentional data modification and ransomware. While data corruption and malicious data modification can be detected by DBAs, ransomware threats fall outside of the monitoring scope for database professionals. It is imperative to have a bulletproof backup to recover from those threats. Key Database Performance Metrics Database performance metrics are extremely important data points that measure the health of database systems and help database professionals maintain efficient support. Some of the metrics are specific to a database type or vendor, and I will generalize them as "internal counters." Availability The first step in monitoring is to determine if a device or resource is available. There is a thin line between system and database availability. A database could be up and running, but clients may not be able to access it. With that said, we need to monitor the following metrics: Network status – Can you reach the database over the network? If yes, what is the latency? While network status may not commonly fall into the direct responsibility of a DBA, database components have configuration parameters that might be responsible for a loss of connectivity. Server up/down Storage availability Service up/down – another shared area between database and OS support teams Whether the database is online or offline CPU, Memory, Storage, and Database Internal Metrics The next important set of server components which could, in essence, escalate into an availability issue are CPU, memory, and storage. The following four performance areas are tightly interconnected and affect each other: Lack of available memory High CPU utilization Storage latency or throughput bottleneck Set of database internal counters which could provide more content to utilization issues For instance, lack of memory may force a database engine to read and write data more frequently, creating contention on the IO system. 100% CPU utilization could often cause an entire database server to stop responding. Numerous database internal counters can help database professionals analyze use trends and identify an appropriate action to mitigate potential impact. Observability Database observability is based on metrics, traces, and logs — what we supposedly collected based on the discussion above. There are a plethora of factors that may affect system and application availability and customer experience. Database performance metrics are just a single set of possible failure points. Supporting the infrastructure underneath a database engine is complex. To successfully monitor a database, we need to have a clear picture of the entire ecosystem and the state of its components while monitoring. Relevant performance data collected from various components can be a tremendous help in identifying and addressing issues before they occur. The entire database monitoring concept is data driven, and it is our responsibility to make it work for us. Monitoring data needs to tell us a story that every consumer can understand. With database observability, this story can be transparent and provide a clear view of your database estate. Balanced Monitoring As you could gather from this article, there are many points of failure in any database environment. While database monitoring is the responsibility of database professionals, it is a collaborative effort of multiple teams to ensure that your entire IT ecosystem is operational. So what's considered "too much" monitoring and when is it not enough? I will use DBAs' favorite phrase: it depends. Assess your environment – It would be helpful to have a configuration management database. If you don't, create a full inventory of your databases and corresponding applications: database sizes, number of users, maintenance schedules, utilization times — as many details as possible. Assess your critical systems – Outline your critical systems and relevant databases. Most likely those will fall into a category of maximum monitoring: availability, performance, activity, and compliance. Assess your budget – It's not uncommon to have a tight cash flow allocated to IT operations. You may or may not have funds to purchase a "we-monitor-everything" system, and certain monitoring aspects would have to be developed internally. Find a middle ground – Your approach to database monitoring is unique to your company's requirements. Collecting monitoring data that has no practical or actionable applications is not efficient. Defining actionable KPIs for your database monitoring is a key to finding a balance — monitor what your team can use to ensure systems availability, stability, and satisfied customers. Remember: Successful database monitoring is data-driven, proactive, continuous, actionable, and collaborative. This is an article from DZone's 2023 Database Systems Trend Report.For more: Read the Report
This is an article from DZone's 2023 Data Pipelines Trend Report.For more: Read the Report Organizations today rely on data to make decisions, innovate, and stay competitive. That data must be reliable and trustworthy to be useful. Many organizations are adopting a data observability culture that safeguards their data accuracy and health throughout its lifecycle. This culture involves putting in motion a series of practices that enable you and your organization to proactively identify and address issues, prevent potential disruptions, and optimize their data ecosystems. When you embrace data observability, you protect your valuable data assets and maximize their effectiveness. Understanding Data Observability "In a world deluged by irrelevant information, clarity is power.”- Yuval Noah Harari, 21 Lessons for the 21st Century, 2018 As Yuval Noah Harari puts it, data is an incredibly valuable asset today. As such, organizations must ensure that their data is accurate and dependable. This is where data observability comes in, but what is data observability exactly? Data observability is the means to ensure our data's health and accuracy, which means understanding how data is collected, stored, processed, and used, plus being able to discover and fix issues in real time. By doing so, we can optimize our system's effectiveness and reliability by identifying and addressing discrepancies while ensuring compliance with regulations like GDPR or CCPA. We can gather valuable insights that prevent errors from recurring in the future by taking such proactive measures. Why Is Data Observability Critical? Data reliability is vital. We live in an era where data underpins crucial decision-making processes, so we must safeguard it against inaccuracies and inconsistencies to ensure our information is trustworthy and precise. Data observability allows organizations to proactively identify and address issues before they can spread downstream, preventing potential disruptions and costly errors. One of the advantages of practicing data observability is that it'll ensure your data is reliable and trustworthy. This means continuously monitoring your data to avoid making decisions based on incomplete or incorrect information, giving you more confidence. Figure 1: The benefits of companies using analytics Data source: The Global State of Enterprise Analytics, 2020, MicroStrategy Analyzing your technology stack can also help you find inefficiencies and areas where resources are underutilized, saving you money. But incorporating automation tools into your data observability process is the cherry on top of the proverbial cake, making everything more efficient and streamlined. Data observability is a long-run approach to safeguarding the integrity of your data so that you can confidently harness its power, whether it's for informed decision-making, regulatory compliance, or operational efficiency. Advantages and Disadvantages of Data Observability When making decisions based on data, it's essential to be quick. But what if the data isn't dependable? That's where data observability comes in. However, like any tool, it has its advantages and disadvantages. IMPLEMENTING DATA OBSERVABILITY: ADVANTAGES AND DISADVANTAGES Advantages Disadvantages Trustworthy insights for intelligent decisions: Data observability provides decision-makers with reliable insights, ensuring well-informed choices in business strategy, product development, and resource allocation. Resource-intensive setup: Implementing data observability demands time and resources to set up tools and processes, but the long-term benefits justify the initial costs. Real-time issue prevention: Data observability acts as a vigilant guardian for your data, instantly detecting issues and averting potential emergencies, thus saving time and resources while maintaining data reliability. Computational overhead from continuous monitoring: Balancing real-time monitoring with computational resources is essential to optimize observability. Enhanced team alignment through shared insights: Data observability fosters collaboration by offering a unified platform for teams to gather, analyze, and act on data insights, facilitating effective communication and problem-solving. Training requirements for effective tool usage: Data observability tools require skill, necessitating ongoing training investments to harness their full potential. Accurate data for sustainable planning: Data observability establishes the foundation for sustainable growth by providing dependable data that's essential for long-term planning, including forecasting and risk assessment. Privacy compliance challenges: Maintaining data observability while adhering to strict privacy regulations like GDPR and CCPA can be intricate, requiring a delicate balance between data visibility and privacy compliance. Resource savings: Data observability allows you to improve how resources are allocated by identifying areas where your technology stack is inefficient or underutilized. As a result, you can save costs and prevent over-provisioning resources, leading to a more efficient and cost-effective data ecosystem. Integration complexities: Integrating data observability into existing data infrastructure may pose challenges due to compatibility issues and legacy systems, potentially necessitating investments in specific technologies and external expertise for seamless integration. Table 1 To sum up, data observability has both advantages and disadvantages, such as providing reliable data, detecting real-time problems, and enhancing teamwork. However, it requires significant time, resources, and training while respecting data privacy. Despite these challenges, organizations that adopt data observability are better prepared to succeed in today's data-driven world and beyond. Cultivating a Data-First Culture Data plays a crucial role in today's fast-paced and competitive business environment. It enables informed decision-making and drives innovation. To achieve this, it's essential to cultivate an environment that values data. This culture should prioritize accuracy, dependability, and consistent monitoring throughout the data's lifecycle. To ensure effective data observability, strong leadership is essential. Leaders should prioritize data from the top down, allocate necessary resources, and set a clear vision for a data-driven culture. This leadership fosters team collaboration and alignment, encouraging them to work together towards the same objectives. When teams collaborate in a supportive work environment, critical data is properly managed and utilized for the organization's benefit. Technical teams and business users must work together to create a culture that values data. Technical teams build the foundation of data infrastructure while business users access data to make decisions. Collaboration between these teams leads to valuable insights that drive business growth. Figure 2: Data generated, gathered, copied, and consumed Data source: Data and Analytics Leadership Annual Executive Survey 2023, NewVantage Partners By leveraging data observability, organizations can make informed decisions, address issues quickly, and optimize their data ecosystem for the benefit of all stakeholders. Nurturing Data Literacy and Accountability Promoting data literacy and accountability is not only about improving efficiency but also an ethical consideration. Assigning both ownership and accountability for data management empowers people to make informed decisions based on data insights, strengthens transparency, and upholds principles of responsibility and integrity, ensuring accuracy, security, and compliance with privacy regulations. A data-literate workforce is a safeguard, identifying instances where data may be misused or manipulated for unethical purposes. Figure 3: The state of data responsibility and data ethics Data source: Amount of data created, consumed, and stored 2010- 2020, with forecasts to 2025, 2023, Statistica Overcoming Resistance To Change Incorporating observability practices is often a considerable challenge, and facing resistance from team members is not uncommon. However, you should confront these concerns and communicate clearly to promote a smooth transition. You can encourage adopting data-driven practices by highlighting the long-term advantages of better data quality and observability, which might inspire your coworkers to welcome changes. Showcasing real-life cases of positive outcomes, like higher revenue and customer satisfaction, can also help make a case. Implementing Data Observability Techniques You can keep your data pipelines reliable and at a high quality by implementing data observability. This implementation involves using different techniques and features that will allow you to monitor and analyze your data. Those processes include data profiling, anomaly detection, lineage, and quality checks. These tools will give you a holistic view of your data pipelines, allowing you to monitor its health and quickly identify any issues or inconsistencies that could affect its performance. Essential Techniques for Successful Implementation To ensure the smooth operation of pipelines, you must establish a proper system for monitoring, troubleshooting, and maintaining data. Employing effective strategies can help achieve this goal. Let's review some key techniques to consider. Connectivity and Integration For optimal data observability, your tools must integrate smoothly with your existing data stack. This integration should not require major modifications to your pipelines, data warehouses, or processing frameworks. This approach allows for an easy deployment of the tools without disrupting your current workflows. Data Monitoring at Rest Observability tools should be able to monitor data while it's at rest without needing to extract it from the current storage location. This method ensures that the monitoring process doesn't affect the speed of your data pipelines and is cost effective. Moreover, this approach makes your data safer as it doesn't require extraction. Automated Anomaly Detection Automated anomaly detection is an important component of data observability. Through machine learning models, patterns and behaviors in data are identified; this enables alerts to be sent when unexpected deviations occur, reducing the number of false positives and alleviating the workload of data engineers who would otherwise have to manage complex monitoring rules. Dynamic Resource Identification Data observability tools give you complete visibility into your data ecosystem. These tools should automatically detect important resources, dependencies, and invariants. They should be flexible enough to adapt to changes in your data environment, giving you insights into vital components without constant manual updates and making data observability extensive and easy to configure. Comprehensive Contextual Information For effective troubleshooting and communication, data observability needs to provide comprehensive contextual information. This information should cover data assets, dependencies, and reasons behind any data gaps or issues. Having the full context will allow data teams to identify and resolve any reliability concerns quickly. Preventative Measures Data observability implements monitoring data assets and offers preventive measures to avoid potential issues. With insights into data and suggesting responsible alterations or revisions, you can proactively address problems before they affect data pipelines. This approach leads to greater efficiency and time savings in the long run. If you need to keep tabs on data, it can be tough to ensure everything is covered. Only using batch and stream processing frameworks isn't enough. That's why it's often best to use a tool specifically made for this purpose. You could use a data platform, add it to your existing data warehouse, or opt for open-source tools. Each of these options has its own advantages and disadvantages: Use a data platform– Data platforms are designed to manage all of your organization's data in one place and grant access to that data through APIs instead of via the platform itself. There are many benefits to using a data platform, including speed, easy access to all your organization's information, flexible deployment options, and increased security. Additionally, many platforms include built-in capabilities for data observability, so you can ensure your databases perform well without having to implement an additional solution. Build data observability into your existing platform – If your organization only uses one application or tool to manage its data, this approach is probably the best for you, provided it includes an observability function. Incorporating data observability into your current setup is a must-have if you manage complex data stored in multiple sources, thus improving the reliability of your data flow cycle. Balancing Automation and Human Oversight Figure 4: Balancing automation and human oversight While automation is a key component of data observability, it's important to strike a balance between automation and human oversight. While automation can help with routine tasks, human expertise is necessary for critical decisions and ensuring data quality. Implementing data observability techniques involves seamless integration, automated anomaly detection, dynamic resource identification, and comprehensive contextual information. Balancing automation and human oversight is important for efficient and effective data observability, resulting in more reliable data pipelines and improved decision-making capabilities. Conclusion In conclusion, data observability empowers organizations to thrive in a world where data fuels decision-making by ensuring data's accuracy, reliability, and trustworthiness. We can start by cultivating a culture that values data integrity, collaboration between technical and business teams, and a commitment to nurturing data literacy and accountability. You will also need a strong data observability framework to monitor your data pipelines effectively. This includes a set of techniques that will help identify issues early and optimize your data ecosystems. But automated processes aren't enough, and we must balance our reliance on automation with human oversight, recognizing that while automation streamlines routine tasks, human expertise remains invaluable for critical decisions and maintaining data quality. With data observability, data integrity is safeguarded, and its full potential is unlocked — leading to innovation, efficiency, and success. This is an article from DZone's 2023 Data Pipelines Trend Report.For more: Read the Report
As previously mentioned, last week I was on-site at the PromCon EU 2023 event for two days in Berlin, Germany. This is a community-organized event focused on the technology and implementations around the open-source Prometheus project, including, for example, PromQL and PromLens. Below you'll find an overview covering insights into the talks given, often with a short recap if you don't want to browse the details. Along with the talks, it was invaluable to have the common discussions and chats that happen between talks in the breaks where you can connect with core maintainers of various aspects of the Prometheus project. Be sure to keep an eye on the event video playlist, as all sessions were recorded and will appear there. Let's dive right in and see what the event had to offer this year in Berlin. This overview will be my impressions of each day of the event, but not all the sessions will be covered. Let's start with a short overview of the insights taken after sessions, chats, and the social event: OpenTelemetry interoperability (in all flavors) is the hot topic of the year. Native Histograms were a big topic the last two years; this year, showing up as having a lot of promise here and there, but not a big topic in this year's talks. Perses dashboard and visualization project presented their Alpha release as a truly open-source project based on the Apache 2.0 license. By my count, there were ~150 attendees, and they also live-streamed all talks/lightning talks, which will also be made available on their YouTube channel post-event. Day 1 The day started with a lovely walk through the center of Berlin and to the venue located on the Spree River. The event opened and jumped right into the following series of talks (insights provided inline): What's New in Prometheus and Its Ecosystem Native Histograms - Efficiency and more details Documentation note on prometheus.io: "...Native histograms (added as an experimental feature in Prometheus v2.40). Once native histograms are closer to becoming a stable feature, this document will be thoroughly updated." stringlabels - Storing labels differently for significant memory reduction keep_firing_for field faded to alerting rules - How long an alert will continue firing after the condition has occurred scrape_config_files - Split prom scrape configs into multiple files, avoiding having to have big config files OTLP receiver (v2.47) - Experimental support for receiving OTLP metrics SNMP Exporter (v0.24) - Breaking changes: new configuration format; splits connection settings from metrics details, simpler to change. Also added the ability to query multiple modules in one scrape using just one scrape. MySQLd Exporter (v0.15) - Multi-target support, use a single exporter to monitor multiple MySQL-alike servers Java client (v1.0.0) - client_java with OpenTelemetry metrics and tracing support, Native Histograms Alertmanager - New receivers. MS Teams, Discord, Webex Windows Exporter - Now an official exporter; was delayed due to licensing but is in the final stages now Every Tuesday Prometheus meets for Bug Scrub at 11:00 UTC. Calendar https://promtheus.io/community. What’s Coming New AlertManager UI Metadata Improvements Exemplary Improvements Remote Write v2 Perses: The CNCF Candidate for Observability Visualization Summary An announcement was given of the Alpha launch of the Perses dashboard and visualization project with GitOps compatibility - purpose-built for observability data; a truly open-source alternative with the Apache 2.0 license. Perses was born from the CNCF landscape missing visualization tooling projects: Perses - An exploration of a standard dashboard format Chronosphere, Red Hat, and Amadeus are displayed as founding members GitOps friendly, static validation, Kubernetes support; you can use the Perses binary in your development environment Chronosphere supported its development and Red Hat is integrating the Perses package into the OpenShift Console. There is an exploration of its usage with Prometheus/PromLens. Currently only metrics display, but ongoing by Red Hat integrating tracing with OpenTelemetry Logs are on the future wishlist. Feature details presented for the development of dashboards Includes Grafana migration tooling I was chatting with core maintainer Augustin Husson after the talk, and they are interested in submitting Perses as an applicant for the CNCF Sandbox status. Towards Making Prometheus OpenTelemetry Native Summary OpenTelemetry protocol (OTLP) support in Prometheus for metrics ingestion is experimental. Details on the Effort OTLP ingestion is there experimentally. The experience with target_info is a big pain point at the moment. Takes about half the bandwidth of remote write, 30-40% more CPU due to gzip New Arrow-based OTLP protocol promises half the bandwidth again at half the CPU cost; may inspire Prometheus remote write 2.0 GitHub milestone to track Thinking about using collector remote config to solve "split configuration" between Prometheus server and OpenTelemetry clients Planet Scale Monitoring: Handling Billions of Active Series With Prometheus and Thanos Summary Shopify states they are running “highly scalable globally distributed and highly dynamic” cloud infrastructure, so they are on “Planet Scale” with Prometheus. Details on the Effort Huge Ruby shop, latency-sensitive, large scaling events around the retail cycle and flash sales HPA struggles with scaling up quickly enough Using StatsD to get around Ruby/Python/PHP-specific limitations on shared counters Backend is Thanos-based, but have added a lot on top of it (custom work) Have a custom operator to scale Prometheus agents by scraping the targets and seeing how many time series they have (including redistribution) Have a router layer on top of Thanos to decouple ingestion and storage; sounds like they're evolving into a a Mimir-like setup Split the query layer into two deployments: one for short-term queries and one for longer-term queries Team and service-centric UI for alerting, integrated with SLO tracking Native histograms solved cardinality challenges and combined with Thanos' distributed querier to make very high cardinality queries work; as they stated, "This changed the game for us." When migrating from the previous observability vendor, they decided not to convert dashboards; instead, worked with developers to build new cleaner ones. Developers are not scoping queries well, so most fan out to all regional stores, but performance on empty responses is satisfactory, so it's not a big issue. Lightning Talks Summary It's always fun to end the day with a quick series of talks that are ad-hoc collected from the attendees. Below is a list of ones I thought were interesting as well as a short summary, should you want to find them in the recordings: AlertManager UI: Alertmanager will get a new UI in React. ELM didn't get traction as a common language; considering alternatives to Bootstrap Implementing integrals with Prometheus and Grafana: Integrals in PromQL- inverse of rates, Pure-PromQL version of the delta counter we do; using sum_over_time and Grafana variables to simplify getting all the right factors. Metrics have a DX Problem: Looking at how to do developer-focused metrics from the IDE using autometrics-dev project on Git Hub; framework for instrumenting by function, with IDE integration to explore prod metrics; interesting idea to integrate this deeply Day 2 After the morning walk through the center of Berlin, day two provided us with some interesting material (insights provided inline): Taming the Tsunami: Low Latency Ingestion of Push-Based Metrics in Prometheus Summary Overview of the metrics story at Shopify, with over 1k teams running it: Originally forwarding metrics "from observability vendor agent" Issues because that was multiplying the cardinality across exporter instances; same with sidecar model Built a StatsD protocol-aware load balancer Running as a sidecar also had ownership issues, stating, "We would be on call for every application" DaemonSet deployment meant resource usage and hot-spotting concerns; also cardinality, but at a lower level Didn't want per-instance metrics because of cardinality and metrics are more domain-level Roughly one exporter per 50-100 nodes Load balancer sanitizes label values and drops labels Pre-aggregation on short time scales to deal with "hot loop instrumentation;" resulted in roughly 20x reduction in bandwidth use Compensating for lack of per-instance metrics by looking at infrastructure metrics (KSM, cAdvisor) "We have close to a thousand teams right now" Prometheus Java Client 1.0.0 Summary V1.0.0 was released last week. This talk was an overview of some of their updates featuring native histograms and OpenTelemetry support. Rewrote the underlying model, so breaking changes with the migration module for Prom simpleclient metrics JavaDoc can be found here. Almost as simple as importing changes in your Java app to use; going to update my workshop Java example for instrumentation to the new API Includes good examples in the project Exposes native + classic histograms by default, scraper's choice A lot more configurations available as Java properties Callback metrics (this is great for writing exporters) OTel push support (on a configurable interval) Allows standard OTel names (with dots), automatically replaces dots with underscores for Prometheus format Integrates with OTel tracing client to make exemplars work - picks exemplars from tracing context, extends tracing context to mark that trace to not get sampled away Despite supporting OTel, this is still a performance-minded client library All metric types support concurrent updates Dropped Pushgateway support for now, but will port it forward Once JMX exporter is updated, as a side effect, you can update Not aiming to become a full OTel library, only future-proofing your instrumentation; more lightweight and performance-focused Lightning Talks Summary Again, here is a list of lightning talks I thought were interesting from the final day and a short summary, should you want to find them in the recordings: Tracking object storage costs Trying to measure object storage costs, as they are the number 2 cost in their cloud bills; built a Prometheus Price Exporter Object storage cost is ~half of Grafana's cloud bill; varies by customer (can be as low as 2%) Trick for extending sparse metrics with zeroes: or on() vector(0) They have a prices exporter in the works; promised to open source it Prom operator - what's next? Tour of some more features coming in the Prometheus operator; shards autoscaling, scrape classes, support Kubernetes events, and Prometheus-agent deployment as DaemonSet Prometheus adoption stats 868k users in 2023 (up from 774k last year), based on Grafana instances which have at least one Prometheus data source enabled Final Impressions Final impressions of this event left me for the second straight year with the feeling that the attendees were both passionate and knowledgeable about the metrics monitoring tooling around the Prometheus ecosystem. This event did not really have "getting started" sessions. Most of this assumes you are coming for in-depth dives into the various elements of the Prometheus project, almost giving you glimpses into the research progress behind features being improved in the coming versions of Prometheus. It remains well worth your time if you are active in the monitoring world, even if you are not using open source or Prometheus: you will gain insights into the status of features in the monitoring world.
This data warehousing use case is about scale. The user is China Unicom, one of the world's biggest telecommunication service providers. Using Apache Doris, they deploy multiple petabyte-scale clusters on dozens of machines to support their 15 billion daily log additions from their over 30 business lines. Such a gigantic log analysis system is part of their cybersecurity management. For the need of real-time monitoring, threat tracing, and alerting, they require a log analytic system that can automatically collect, store, analyze, and visualize logs and event records. From an architectural perspective, the system should be able to undertake real-time analysis of various formats of logs, and of course, be scalable to support the huge and ever-enlarging data size. The rest of this article is about what their log processing architecture looks like and how they realize stable data ingestion, low-cost storage, and quick queries with it. System Architecture This is an overview of their data pipeline. The logs are collected into the data warehouse, and go through several layers of processing. ODS: Original logs and alerts from all sources are gathered into Apache Kafka. Meanwhile, a copy of them will be stored in HDFS for data verification or replay. DWD: This is where the fact tables are. Apache Flink cleans, standardizes, backfills, and de-identifies the data, and write it back to Kafka. These fact tables will also be put into Apache Doris, so that Doris can trace a certain item or use them for dashboarding and reporting. As logs are not averse to duplication, the fact tables will be arranged in the Duplicate Key model of Apache Doris. DWS: This layer aggregates data from DWD and lays the foundation for queries and analysis. ADS: In this layer, Apache Doris auto-aggregates data with its Aggregate Key model, and auto-updates data with its Unique Key model. Architecture 2.0 evolves from Architecture 1.0, which is supported by ClickHouse and Apache Hive. The transition arised from the user's needs for real-time data processing and multi-table join queries. In their experience with ClickHouse, they found inadequate support for concurrency and multi-table joins, manifested by frequent timeouts in dashboarding and OOM errors in distributed joins. Now let's take a look at their practice in data ingestion, storage, and queries with Architecture 2.0. Real-Case Practice Stable Ingestion of 15 Billion Logs Per Day In the user's case, their business churns out 15 billion logs every day. Ingesting such data volume quickly and stably is a real problem. With Apache Doris, the recommended way is to use the Flink-Doris-Connector. It is developed by the Apache Doris community for large-scale data writing. The component requires simple configuration. It implements Stream Load and can reach a writing speed of 200,000~300,000 logs per second, without interrupting the data analytic workloads. A lesson learned is that when using Flink for high-frequency writing, you need to find the right parameter configuration for your case to avoid data version accumulation. In this case, the user made the following optimizations: Flink Checkpoint: They increase the checkpoint interval from 15s to 60s to reduce writing frequency and the number of transactions processed by Doris per unit of time. This can relieve data writing pressure and avoid generating too many data versions. Data Pre-Aggregation: For data of the same ID but comes from various tables, Flink will pre-aggregate it based on the primary key ID and create a flat table, in order to avoid excessive resource consumption caused by multi-source data writing. Doris Compaction: The trick here includes finding the right Doris backend (BE) parameters to allocate the right amount of CPU resources for data compaction, setting the appropriate number of data partitions, buckets, and replicas (too much data tablets will bring huge overheads), and dialing up max_tablet_version_num to avoid version accumulation. These measures together ensure daily ingestion stability. The user has witnessed stable performance and low compaction score in Doris backend. In addition, the combination of data pre-processing in Flink and the Unique Key model in Doris can ensure quicker data updates. Storage Strategies to Reduce Costs by 50% The size and generation rate of logs also impose pressure on storage. Among the immense log data, only a part of it is of high informational value, so storage should be differentiated. The user has three storage strategies to reduce costs. ZSTD (ZStandard) compression algorithm: For tables larger than 1TB, specify the compression method as "ZSTD" upon table creation, it will realize a compression ratio of 10:1. Tiered storage of hot and cold data: This is supported by a new feature of Doris. The user sets a data "cooldown" period of 7 days. That means data from the past 7 days (namely, hot data) will be stored in SSD. As time goes by, hot data "cools down" (getting older than 7 days), it will be automatically moved to HDD, which is less expensive. As data gets even "colder," it will be moved to object storage for much lower storage costs. Plus, in object storage, data will be stored with only one copy instead of three. This further cuts down costs and the overheads brought by redundant storage. Differentiated replica numbers for different data partitions: The user has partitioned their data by time range. The principle is to have more replicas for newer data partitions and less for the older ones. In their case, data from the past 3 months is frequently accessed, so they have 2 replicas for this partition. Data that is 3~6 months old has two replicas, and data from 6 months ago has one single copy. With these three strategies, the user has reduced their storage costs by 50%. Differentiated Query Strategies Based on Data Size Some logs must be immediately traced and located, such as those of abnormal events or failures. To ensure real-time response to these queries, the user has different query strategies for different data sizes: Less than 100G: The user utilizes the dynamic partitioning feature of Doris. Small tables will be partitioned by date and large tables will be partitioned by hour. This can avoid data skew. To further ensure balance of data within a partition, they use the snowflake ID as the bucketing field. They also set a starting offset of 20 days, which means data of the recent 20 days will be kept. In this way, they find the balance point between data backlog and analytic needs. 100G~1T: These tables have their materialized views, which are the pre-computed result sets stored in Doris. Thus, queries on these tables will be much faster and less resource-consuming. The DDL syntax of materialized views in Doris is the same as those in PostgreSQL and Oracle. More than 100T: These tables are put into the Aggregate Key model of Apache Doris and pre-aggregate them. In this way, we enable queries of 2 billion log records to be done in 1~2s. These strategies have shortened the response time of queries. For example, a query of a specific data item used to take minutes, but now it can be finished in milliseconds. In addition, for big tables that contain 10 billion data records, queries on different dimensions can all be done in a few seconds. Ongoing Plans The user is now testing with the newly added inverted index in Apache Doris. It is designed to speed up full-text search of strings as well as equivalence and range queries of numerics and datetime. They have also provided their valuable feedback about the auto-bucketing logic in Doris: Currently, Doris decides the number of buckets for a partition based on the data size of the previous partition. The problem for the user is, most of their new data comes in during daytime, but little at nights. So in their case, Doris creates too many buckets for night data but too few in daylight, which is the opposite of what they need. They hope to add a new auto-bucketing logic, where the reference for Doris to decide the number of buckets is the data size and distribution of the previous day. They've come to the Apache Doris community and we are now working on this optimization.
Your team celebrates a success story where a trace identified a pesky latency issue in your application's authentication service. A fix was swiftly implemented, and we all celebrated a quick win in the next team meeting. But the celebrations are short-lived. Just days later, user complaints surged about a related payment gateway timeout. It turns out that the fix we made did improve performance at one point but created a situation in which key information was never cached. Other parts of the software react badly to the fix, and we need to revert the whole thing. While the initial trace provided valuable insights into the authentication service, it didn’t explain why the system was built in this way. Relying solely on a single trace has given us a partial view of a broader problem. This scenario underscores a crucial point: while individual traces are invaluable, their true potential is unlocked only when they are viewed collectively and in context. Let's delve deeper into why a single trace might not be the silver bullet we often hope for and how a more holistic approach to trace analysis can paint a clearer picture of our system's health and the way to combat problems. The Limiting Factor The first problem is the narrow perspective. Imagine debugging a multi-threaded Java application. If you were to focus only on the behavior of one thread, you might miss how it interacts with others, potentially overlooking deadlocks or race conditions. Let's say a trace reveals that a particular method, fetchUserData(), is taking longer than expected. By optimizing only this method, you might miss that the real issue is with the synchronized block in another related method, causing thread contention and slowing down the entire system. Temporal blindness is the second problem. Think of a Java Garbage Collection (GC) log. A single GC event might show a minor pause, but without observing it over time, you won't notice if there's a pattern of increasing pause times indicating a potential memory leak. A trace might show that a Java application's response time spiked at 2 PM. However, without looking at traces over a longer period, you might miss that this spike happens daily, possibly due to a scheduled task or a cron job that's putting undue stress on the system. The last problem is related to that and is the context. Imagine analyzing the performance of a Java method without knowing the volume of data it's processing. A method might seem inefficient, but perhaps it's processing a significantly larger dataset than usual. A single trace might show that a Java method, processOrders(), took 5 seconds to execute. However, without context, you wouldn't know if it was processing 50 orders or 5,000 orders in that time frame. Another trace might reveal that a related method, fetchOrdersFromDatabase(), is retrieving an unusually large batch of orders due to a backlog, thus providing context to the initial trace. Strength in Numbers Think of traces as chapters in a book and metrics as the book's summary. While each chapter (trace) provides detailed insights, the summary (metrics) gives an overarching view. Reading chapters in isolation might lead to missing the plot, but when read in sequence and in tandem with the summary, the story becomes clear. We need this holistic view. If individual traces show that certain Java methods like processTransaction() are occasionally slow, grouped traces might reveal that these slowdowns happen concurrently, pointing to a systemic issue. Metrics, on the other hand, might show a spike in CPU usage during these times, indicating that the system might be CPU-bound during high transaction loads. This helps us distinguish between correlation and causation. Grouped traces might show that every time the fetchFromDatabase() method is slow, the updateCache() method also lags. While this indicates a correlation, metrics might reveal that cache misses (a specific metric) increase during these times, suggesting that database slowdowns might be causing cache update delays, establishing causation. This is especially important in performance tuning. Grouped traces might show that the handleRequest() method's performance has been improving over several releases. Metrics can complement this by showing a decreasing trend in response times and error rates, confirming that recent code optimizations are having a positive impact. I wrote about this extensively in a previous post about the Tong motion needed to isolate an issue. This motion can be accomplished purely through the use of observability tools such as traces, metrics, and logs. Example Observability is somewhat resistant to examples. Everything I try to come up with feels a bit synthetic and unrealistic when I examine it after the fact. Having said that, I looked at my modified version of the venerable Spring Pet Clinic demo using digma.ai. Running it showed several interesting concepts taken by Digma. Probably the most interesting feature is the ability to look at what’s going on in the server at this moment. This is an amazing exploratory tool that provides a holistic view of a moment in time. But the thing I want to focus on is the “Insights” column on the right. Digma tries to combine the separate traces into a coherent narrative. It’s not bad at it, but it’s still a machine. Some of that value should probably still be done manually since it can’t understand the why, only the what. It seems it can detect the venerable Spring N+1 problem seamlessly. But this is only the start. One of my favorite things is the ability to look at tracing data next to a histogram and list of errors in a single view. Is performance impacted because there are errors? How impactful is the performance on the rest of the application? These become questions with easy answers at this point. When we see all the different aspects laid together. Magical APIs The N+1 problem I mentioned before is a common bug in Java Persistence API (JPA). The great Vlad Mihalcea has an excellent explanation. The TL;DR is rather simple. We write a simple database query using ORM. But we accidentally split the transaction, causing the data to be fetched N+1 times, where N is the number of records we fetch. This is painfully easy to do since transactions are so seamless in JPA. This is the biggest problem in “magical” APIs like JPA. These are APIs that do so much that they feel like magic, but under the hood, they still run regular old code. When that code fails, it’s very hard to see what goes on. Observability is one of the best ways to understand why these things fail. In the past, I used to reach out to the profiler for such things, which would often entail a lot of work. Getting the right synthetic environment for running a profiling session is often very challenging. Observability lets us do that without the hassle. Final Word Relying on a single individual trace is akin to navigating a vast terrain with just a flashlight. While these traces offer valuable insights, their true potential is only realized when viewed collectively. The limitations of a single trace, such as a narrow perspective, temporal blindness, and lack of context, can often lead developers astray, causing them to miss broader systemic issues. On the other hand, the combined power of grouped traces and metrics offers a panoramic view of system health. Together, they allow for a holistic understanding, precise correlation of issues, performance benchmarking, and enhanced troubleshooting. For Java developers, this tandem approach ensures a comprehensive and nuanced understanding of applications, optimizing both performance and user experience. In essence, while individual traces are the chapters of our software story, it's only when they're read in sequence and in tandem with metrics that the full narrative comes to life.
If your environment is like many others, it can often seem like your systems produce logs filled with a bunch of excess data. Since you need to access multiple components (servers, databases, network infrastructure, applications, etc.) to read your logs — and they don’t typically have any specific purpose or focus holding them together — you may dread sifting through them. If you don’t have the right tools, it can feel like you’re stuck with a bunch of disparate, hard-to-parse data. In these situations, I picture myself as a cosmic collector, gathering space debris as it floats by my ship and sorting the occasional good material from the heaps of galactic material. Though it can feel like more trouble than it’s worth, sorting through logs is crucial. Logs hold many valuable insights into what’s happening in your applications and can indicate performance problems, security issues, and user behavior. In this article, we’re going to take a look at how logging can help you make sense of your log data without much effort. We'll talk about best practices and habits and use some of the Log Analytics tools from Sumo Logic as examples. Let’s blast off and turn that cosmic trash into treasure! The Truth Is Out There: Getting Value Just From the Things You’re Already Logging One massive benefit offered by a log analytics platform to any system engineer is the ability to utilize a single log interface. Rather than needing to SSH into countless machines or download logs and parse through them manually, viewing all your logs in a centralized aggregator can make it much easier to see simultaneous events across your infrastructure. You’ll also be able to clearly follow the flow of data and requests through your stack. Once you see all your logs in one place, you can tap into the latent value of all that data. Of course, you could make your own aggregation interface from scratch, but often, log aggregation tools provide a number of extra features that are worth the additional investment. Those extra features include capabilities such as powerful search and fast analytics. Searching Through the Void: Using Search Query Language To Find Things You’ve probably used grep or similar tools for searching through your logs, but for real power, you need the ability to search across all of your logs in one interface. You may have even investigated using the ELK stack on your own infrastructure to get going with log aggregation. If you have, you know how valuable putting logs all in the same place can be. Some tools provide even more functionality on top of this interface. For example, with Log Analytics, you can use a Search Query Language that allows for more complex searches. Because these searches are being executed across a vast amount of log data, you can use special operations to harness the power of your log aggregation service. Some of these operations can be achieved with grep, so long as you have all of the logs at your disposal. But others, such as aggregate operators, field expressions, or transaction analytics tools, can produce extremely powerful reports and monitoring triggers across your infrastructure. To choose just one tool as an example, let’s take a closer look at field expressions. Essentially, field expressions allow you to create variables in your queries based on what you find through your log data. For example, if you wanted to search across your logs, and you know your log lines follow the format “From Jane To: John,” you can parse out the “from” and “to” with the following query: * | parse "From: * To: *" as (from, to) This would store “Jane” in the “from” field and “John” in the “to” field. Another valuable language feature you could tap into would be keyword expressions. You could use this query to search across your logs for any instances where a command with root privileges failed: (``su`` OR ``sudo`` ) AND (fail* OR error) Here is a listing of General Search Examples that are drawn from parsing a single Apache log message. Light-Speed Analytics: Making Use of Real-Time Reports and Advanced Analytics One other aspect of searching is that it’s typically looking into the past. Sometimes, you need to be looking at things as they happen. Let’s take a look at Live Tail and LogReduce — two tools to improve simple searches. Versions of these features exist on many platforms, but I like the way they work on Sumo Logic’s offering, so we’ll dive into them. Live Tail At its simplest, Live Tail lets you see a live feed of your log messages. It’s like running tail-f on any one of your servers to see the logs as they come in, but instead of being on a single machine, you’re looking across all logs associated with a Sumo Logic Source or Collector. Your Live Tail can be modified to automatically filter for only specific things. Live Tails also supports highlighting keywords (up to eight of them) as the logs roll in. LogReduce LogReduce gives you more insight into–or a better understanding of–your search query’s aggregate log results. When you run LogReduce on a query, it performs fuzzy logic analysis on messages meeting the search criteria you defined and then provides you with a set of “Signatures” that meet your criteria. It also gives you a count of the logs with that pattern and a rating of the relevance of the pattern when compared to your search. You then have tools at your disposal to rank the generated signatures and even perform further analysis on the log data. This is all pretty advanced and can be hard to understand without a demo, so you can dive deeper by watching this video. Integrated Log Aggregation Often, you’ll need information from systems you aren’t running directly mixed in with your other logs. That’s why it’s important to make sure you can integrate your log aggregator with other systems. Many log aggregators provide this functionality. Elastic, which underlies the ELK stack, provides a bunch of integrations that you can hook into your self-hosted or cloud-hosted stack. Of course, integrations aren’t only available on the ELK stack. Sumo Logic also provides a whole list of integrations as well. Regardless, the power of connecting your logs with the many systems you use outside of your monitoring and operational stack is phenomenal. Want to get logs sent from your company’s 1Password account into the rest of your logs? Need more information from AWS than you are getting on your individual instances or services? ELK and Sumo Logic provide great options. The key to understanding this concept is that you don’t need to be the one controlling the logs to make it valuable to aggregate them. Think through the full picture of what systems keep your business running, and consider putting all of the logs in your aggregator together. Conclusion This has been a brief tour through some of the features available with log aggregation. There’s a lot more to it, which shouldn’t be surprising given the vast amount of data generated every second by our infrastructure. The really amazing part of these tools is that these insights are available to you without installing anything on your servers. You just need to have a way to export your log data to the aggregation service. Whether you need to track compliance or monitor the reliability of your services, log aggregation is an incredibly powerful tool that can let you unlock infinite value from your already existing log data. That way, you can become a better cosmic junk collector!
The World Has Changed, and We Need To Adapt The world has gone through a tremendous transformation in the last fifteen years. Cloud and microservices changed the world. Previously, our application was using one database; developers knew how it worked, and the deployment rarely happened. A single database administrator was capable of maintaining the database, optimizing the queries, and making sure things worked as expected. The database administrator could just step in and fix the performance issues we observed. Software engineers didn’t need to understand the database, and even if they owned it, it was just a single component of the system. Guaranteeing software quality was much easier because the deployment happened rarely, and things could be captured on time via automated tests. Fifteen years later, everything is different. Companies have hundreds of applications, each one with a dedicated database. Deployments happen every other hour, deployment pipelines work continuously, and keeping track of flowing changes is beyond one’s capabilities. The complexity of the software increased significantly. Applications don’t talk to databases directly but use complex libraries that generate and translate queries on the fly. Application monitoring is much harder because applications do not work in isolation, and each change may cause multiple other applications to fail. Reasoning about applications is now much harder. It’s not enough to just grab the logs to understand what happened. Things are scattered across various components, applications, queues, service buses, and databases. Databases changed as well. We have various SQL distributions, often incompatible despite having standards in place. We have NoSQL databases that provide different consistency guarantees and optimize their performance for various use cases. We developed multiple new techniques and patterns for structuring our data, processing it, and optimizing schemas and indexes. It’s not enough now to just learn one database; developers need to understand various systems and be proficient with their implementation details. We can’t rely on ACID anymore as it often harms the performance. However, other consistency levels require a deep understanding of the business. This increases the conceptual load significantly. Database administrators have a much harder time keeping up with the changes, and they don’t have enough time to improve every database. Developers are unable to analyze and get the full picture of all the moving parts, but they need to deploy changes faster than ever. And the monitoring tools still swamp us with metrics instead of answers. Given all the complexity, we need developers to own their databases and be responsible for their data storage. This “shift left” in responsibility is a must in today’s world for both small startups and big Fortune 500 enterprises. However, it’s not trivial. How do we prevent the bad code from reaching production? How to troubleshoot issues automatically? How do we move from monitoring to observability? Finally, how do we give developers the proper tools and processes so they will be able to own the databases? Read on to find answers. Measuring Application Performance Is Complex It’s crucial to measure to improve the performance. Performance indicators (PIs) help us evaluate the performance of the system on various dimensions. They can focus on infrastructure aspects such as the reliability of the hardware or networking. They can use application metrics to assess the performance and stability of the system. They can also include business metrics to measure the success from the company and user perspective, including user retention or revenue. Performance indicators are important tracking mechanisms to understand the state of the system and the business as a whole. However, in our day-to-day job, we need to track many more metrics. We need to understand contributors to the performance indicators to troubleshoot the issues earlier and understand whether the system is healthy or not. Let’s see how to build these elements in the modern world. We typically need to start with telemetry — the ability to collect the signals. There are multiple types of signals that we need to track: logs (especially application logs), metrics, and traces. Capturing these signals can be a matter of proper configuration (like enabling them in the hosting provider panel), or they need to be implemented by the developers. Recently, OpenTelemetry gained significant popularity. It’s a set of SDKs for popular programming languages that can be used to instrument applications to generate signals. This way, we have a standardized way of building telemetry within our applications. Odds are that most of the frameworks and libraries we use are already integrated with OpenTelemetry and can generate signals properly. Next, we need to build a solution for capturing the telemetry signals in one centralized place. This way, we can see “what happens” inside the system. We can browse the signals from the infrastructure (like hosts, CPUs, GPUs, and network), applications (number of requests, errors, exceptions, data distribution), databases (data cardinality, number of transactions, data distribution), and many other parts of the application (queues, notification services, service buses, etc.). This lets us troubleshoot more easily as we can see what happens in various parts of the ecosystem. Finally, we can build the Application Performance Management (APM). It’s the way of tracking metric indicators with telemetry and dashboards. APM focuses on providing end-to-end monitoring that goes across all the components of the system, including the web layer, mobile and desktop applications, databases, and the infrastructure connecting all the elements. It can be used to automate alarms and alerts to constantly assess whether the system is healthy. APM may seem like a silver bullet. It aggregates metrics, shows the performance, and can quickly alert when something goes wrong, and the fire begins. However, it’s not that simple. Let’s see why. Why Application Performance Monitoring Is Not Enough APM captures signals and presents them in a centralized application. While this may seem enough, it lacks multiple features that we would expect from a modern maintenance system. First, APM typically presents raw signals. While it has access to various metrics, it doesn’t connect the dots easily. Imagine that the CPU spikes. Should you migrate to a bigger machine? Should you optimize the operating system? Should you change the driver? Or maybe the CPU spike is caused by different traffic coming to the application? You can’t tell that easily just by looking at metrics. Second, APM doesn’t easily show where the problem is. We may observe metrics spiking in one part of the system, but it doesn’t necessarily mean that the part is broken. There may be other reasons and issues. Maybe it’s wrong input coming to the system, maybe some external dependency is slow, and maybe some scheduled task runs too often. APM doesn’t show that, as it cannot connect the dots and show the flow of changes throughout the system. You just see the state then, but you don’t see how you got to that point easily. Third, the resolution is unknown. Let’s say that the CPU spiked during the scheduled maintenance task. Should we upscale the machine? Should we disable the task? Should we run it some other time? Is there a bug in the task? Many things are not clear. We can easily imagine a situation when the scheduled task runs in the middle of the day just because it is more convenient for the system administrators; however, the task is now slow and competes with regular transactions for the resources. In that case, we probably should move the task to some time outside of peak hours. Another scenario is that the task was using an index that doesn’t work anymore. Therefore, it’s not about the task per se, but it’s about the configuration that has been changed with the last deployment. Therefore, we should fix the index. APM won’t show us all those details. Fourth, APM is not very readable. Dashboards with metrics look great, but they are too often just checked whether they’re green. It’s not enough to see that alarms are not ringing. We need to manually review the metrics, look for anomalies, understand how they change, and if we have all the alarms in place. This is tedious and time-consuming, and many developers don’t like doing that. Metrics, charts, graphs, and other visualizations swamp us with raw data that doesn’t show the big picture. Finally, one person can’t reason about the system. Even if we have a dedicated team for maintenance, the team won’t have an understanding of all the changes going through the system. In the fast-paced world with tens of deployments every day, we can’t look for issues manually. Every deployment may result in an outage due to invalid schema migration, bad code change, cache purge, lack of hardware, bad configuration, or many more issues. Even when we know something is wrong and we can even point to the place, the team may lack the understanding or knowledge needed to identify the root cause. Involving more teams is time-consuming and doesn’t scale. While APM looks great, it’s not the ultimate solution. We need something better. We need something that connects the dots and provides answers instead of data. We need true observability. What Makes the Observability Shine Observability turns alerts into root causes and raw data into understanding. Instead of charts, diagrams, and graphs, we want to have a full story of the changes going through pipelines and how they affect the system. This should understand the characteristics of the application, including the deployment scheme, data patterns, partitioning, sharding, regionalization, and other things specific to the application. Observability lets us reason about the internals of the system from the outside. For instance, we can reason that we deployed the wrong changes to the production environment because there is a metric spike in the database. We don’t focus on the database per se, but we analyze the difference between the current and the previous code. However, if there was no deployment recently, but we observe much higher traffic on the load balancer, then we can reason that it’s probably due to different traffic coming to the application. Observability makes the interconnections clear and visible. To build observability, we need to capture static signals and dynamic history. We need to include our deployments, configuration, extensions, connectivity, and characteristics of our application code. It’s not enough just to see that “something is red now.” We need to understand how we got there and what could be the possible reason. To achieve that, a good observability solution needs to go through multiple steps. First, we need to be able to pinpoint the problem. In the modern world of microservices and bounded contexts, it’s not trivial. If the CPU spikes, we need to be able to answer which service or application caused that, which tenant is responsible, or whether this is for all the traffic or some specific requests in the case of a web application. We can do that by carefully observing metrics with multiple dimensions, possibly with dashboards and alarms. Second, we need to include multiple signals. CPU spikes can be caused by a lack of hardware, wrong configuration, broken code, unexpected traffic, or simply things that shouldn’t be running at that time. What’s more, maybe something unexpected happened around the time of the issue. This could be related to a deployment, an ongoing sports game, a specific time of week or time of year, some promotional campaign we just started, or some outage in the cloud infrastructure. All these inputs must be provided to the observability system to understand the bigger picture. Third, we need to look for anomalies. It may seem counterintuitive, but digital applications rot over time. Things change, traffic changes, updates are installed, security fixes are deployed, and every single change can break our application. However, the outage may not be quick and easy. The application may get slower and slower over time, and we won’t notice that easily because alarms do not go off or they become red only for a short period. Therefore, we need to have anomaly detection built-in. We need to be able to look for traffic patterns, weekly trends, and known peaks during the year. A proper observability solution needs to be aware of these and automatically find the situations in which the metrics don’t align. Fourth, we need to be able to automatically root cause the issue and suggest a solution. We can’t push the developers to own the databases and the systems without proper tooling. The observability systems need to be able to automatically suggest improvements. We need to unblock the developers so they can finally be responsible for the performance and own the systems end to end. Databases and Observability We Need Today Let’s now see what we need in the domain of databases. Many things can break, and it’s worth exploring the challenges we may face when working with SQL or NoSQL databases. We are going to see the three big areas where things may go wrong. These are code changes, schema changes, and execution changes. Code Changes Many database issues come from the code changes. Developers modify the application code, and that results in different SQL statements being sent to the database. These queries may be inherently slow, but these won’t be captured by the testing processes we have in place now. Imagine that we have the following application code that extracts the user aggregate root. The user may have multiple additional pieces of information associated with them, like details, pages, or texts: JavaScript const user = repository.get("user") .where("user.id = 123") .leftJoin("user.details", "user_details_table") .leftJoin("user.pages", "pages_table") .leftJoin("user.texts", "texts_table") .leftJoin("user.questions", "questions_table") .leftJoin("user.reports", "reports_table") .leftJoin("user.location", "location_table") .leftJoin("user.peers", "peers_table") .getOne() return user; The code generates the following SQL statement: SQL SELECT * FROM users AS user LEFT JOIN user_details_table AS detail ON detail.user_id = user.id LEFT JOIN pages_table AS page ON page.user_id = user.id LEFT JOIN texts_table AS text ON text.user_id = user.id LEFT JOIN questions_table AS question ON question.user_id = user.id LEFT JOIN reports_table AS report ON report.user_id = user.id LEFT JOIN locations_table AS location ON location.user_id = user.id LEFT JOIN peers_table AS peer ON Peer.user_id = user.id WHERE user.id = '123' Because of multiple joins, the query returns nearly 300 thousand rows to the application that are later processed by the mapper library. This takes 25 seconds in total. Just to get one user entity. The problem with such a query is that we don’t see the performance implications when we write the code. If we have a small developer database with only a hundred rows, then we won’t get any performance issues when running the code above locally. Unit tests won’t catch that either because the code is “correct” — it returns the expected result. We won’t see the issue until we deploy to production and see that the query is just too slow. Another problem is a well-known N+1 query problem with Object Relational Mapper (ORM) libraries. Imagine that we have table flights that are in 1-to-many relation with table tickets. If we write a code to get all the flights and count all the tickets, we may end up with the following: C# var totalTickets = 0; var flights = dao.getFlights(); foreach(var flight in flights){ totalTickets + flight.getTickets().count; } This may result in N+1 queries being sent in total. One query to get all the flights, and then n queries to get tickets for every flight: SQL SELECT * FROM flights; SELECT * FROM tickets WHERE ticket.flight_id = 1; SELECT * FROM tickets WHERE ticket.flight_id = 2; SELECT * FROM tickets WHERE ticket.flight_id = 3; ... SELECT * FROM tickets WHERE ticket.flight_id = n; Just as before, we don’t see the problem when running things locally, and our tests won’t catch that. We’ll find the problem only when we deploy to an environment with a sufficiently big data set. Yet another thing is about rewriting queries to make them more readable. Let’s say that we have a table boarding_passes. We want to write the following query (just for exemplary purposes): SQL SELECT COUNT(*) FROM boarding_passes AS C1 JOIN boarding_passes AS C2 ON C2.ticket_no = C1.ticket_no AND C2.flight_id = C1.flight_id AND C2.boarding_no = C1.boarding_no JOIN boarding_passes AS C3 ON C3.ticket_no = C1.ticket_no AND C3.flight_id = C1.flight_id AND C3.boarding_no = C1.boarding_no WHERE MD5(MD5(C1.ticket_no)) = '525ac610982920ef37b34aa56a45cd06' AND MD5(MD5(C2.ticket_no)) = '525ac610982920ef37b34aa56a45cd06' AND MD5(MD5(C3.ticket_no)) = '525ac610982920ef37b34aa56a45cd06' This query joins the table with itself three times, calculates the MD5 hash of the ticket number twice, and then filters rows based on the condition. This code runs for 8 seconds on my machine with the demo database. A programmer may now want to avoid this repetition and rewrite the query to the following: SQL WITH cte AS ( SELECT *, MD5(MD5(ticket_no)) AS double_hash FROM boarding_passes ) SELECT COUNT(*) FROM cte AS C1 JOIN cte AS C2 ON C2.ticket_no = C1.ticket_no AND C2.flight_id = C1.flight_id AND C2.boarding_no = C1.boarding_no JOIN cte AS C3 ON C3.ticket_no = C1.ticket_no AND C3.flight_id = C1.flight_id AND C3.boarding_no = C1.boarding_no WHERE C1.double_hash = '525ac610982920ef37b34aa56a45cd06' AND C2.double_hash = '525ac610982920ef37b34aa56a45cd06' AND C3.double_has = '525ac610982920ef37b34aa56a45cd06' The query is now more readable as it avoids repetition. However, the performance dropped, and the query now executes in 13 seconds. Now, when we deploy changes like these to production, we may reason that we need to upscale the database. Seemingly, nothing has changed, but the database is now much slower. With good observability tools, we would see that the query executed behind the scenes is now different, which leads to a performance drop. Schema Changes Another problem around databases is when it comes to schema management. There are generally three different ways of modifying the schema: we can add something (table, column index, etc.), remove something, or modify something. Each schema modification is dangerous because the database engine may need to rewrite the table — copy the data on the side, modify the table schema, and then copy the data back. This may lead to a very long deployment (minutes, hours, even months) that we can’t optimize or stop in the middle. Additionally, we typically won’t see the problems when running things locally because we typically run our tests against the latest schema. A good observability solution needs to capture these changes before running in production. Indexes pose another interesting challenge. Adding an index seems to be safe. However, as is the case with every index, it needs to be maintained over time. Indexes generally improve the read performance because they help us find rows much faster. At the same time, they decrease the modification performance as every data modification must be performed in the table and in all the indexes. However, indexes may not be useful after some time. It’s often the case that we configure an index; a couple of months later, we change the application code, and the index isn’t used anymore. Without good observability systems, we won’t be able to notice that the index isn’t useful anymore and decreases the performance. Execution Changes Yet another area of issues is related to the way we execute queries. Databases prepare a so-called execution plan of the query. Whenever a statement is sent to the database, the engine analyzes indexes, data distribution, and statistics of the tables’ content to figure out the fastest way of running the query. Such an execution plan heavily depends on the content of our database and running configuration. The execution plan dictates what join strategy to use when joining tables (nested loop join, merge join, hash join, or maybe something else), which indexes to scan (or tables instead), and when to sort and materialize the results. We can affect the execution plan by providing query hints. Inside the SQL statements, we can specify what join strategy to use or what locks to acquire. The database may use these hints to improve the performance but may also disregard them and execute things differently. However, we don’t know whether the database used them or not. Things get worse over time. Indexes may change after the deployment, data distribution may depend on the day of the week, and the database load may be much different between countries when we regionalize our application. Query hints that we provided half a year ago may not be relevant anymore, but our tests won’t catch that. Unit tests are used to verify the correctness of our queries, and the queries will still return the same results. We have simply no way of identifying these changes automatically. Database Guardrails Is the New Standard Based on what we said above, we need a new approach. No matter if we run a small product or a big Fortune 500 company, we need a novel way of dealing with databases. Developers need to own their databases and have all the means to do it well. We need good observability and database guardrails — a novel approach that: Prevents the bad code from reaching production, Monitors all moving pieces to build a meaningful context for the developer, It significantly reduces the time to identify the root cause and troubleshoot the issues, so the developer gets direct and actionable insights We can’t let ourselves go blind anymore. We need to have tools and systems that will help us change the way we interact with databases, avoid performance issues, and troubleshoot problems as soon as they appear in production. Let’s see how we can build such a system. There are four things that we need to capture to build successful database guardrails. Let’s walk through them. Database Internals Each database provides enough details about the way it executes the query. These details are typically captured in the execution plan that explains what join strategies were used, which tables and indexes were scanned, or what data was sorted. To get the execution plan, we can typically use the EXPLAIN keyword. For instance, if we take the following PostgreSQL query: SQL SELECT TB.* FROM name_basics AS NB JOIN title_principals AS TP ON TP.nconst = NB.nconst JOIN title_basics AS TB ON TB.tconst = TP.tconst WHERE NB.nconst = 'nm00001' We can add EXPLAIN to get the following query: SQL EXPLAIN SELECT TB.* FROM name_basics AS NB JOIN title_principals AS TP ON TP.nconst = NB.nconst JOIN title_basics AS TB ON TB.tconst = TP.tconst WHERE NB.nconst = 'nm00001' The query returns the following output: SQL Nested Loop (cost=1.44..4075.42 rows=480 width=89) -> Nested Loop (cost=1.00..30.22 rows=480 width=10) -> Index Only Scan using name_basics_pkey on name_basics nb (cost=0.43..4.45 rows=1 width=10) Index Cond: (nconst = 'nm00001'::text) -> Index Only Scan using title_principals_nconst_idx on title_principals tp (cost=0.56..20.96 rows=480 width=20) Index Cond: (nconst = 'nm00001'::text) -> Index Scan using title_basics_pkey on title_basics tb (cost=0.43..8.43 rows=1 width=89) Index Cond: (tconst = tp.tconst) This gives a textual representation of the query and how it will be executed. We can see important information about the join strategy (Nested Loop in this case), tables and indexes used (Index Only Scan for name_basics_pkey, or Index Scan for title_basics_pkey), and the cost of each operation. Cost is an arbitrary number indicating how hard it is to execute the operation. We shouldn’t draw any conclusions from the numbers per se, but we can compare various plans based on the cost and choose the cheapest one. Having plans at hand, we can easily tell what’s going on. We can see if we have an N+1 query issue if we use indexes efficiently and if the operation runs fast. We can get some insights into how to improve the queries. We can immediately tell if a query is going to scale well in production just by looking at how it reads the data. Once we have these plans, we can move on to another part of successful database guardrails. Integration With Applications We need to extract plans somehow and correlate them with what our application does. To do that, we can use OpenTelemetry (OTel). OpenTelemetry is an open standard for instrumenting applications. It provides multiple SDKs for various programming languages and is now commonly used in frameworks and libraries for HTTP, SQL, ORM, and other application layers. OpenTelemetry captures signals: logs, traces, and metrics. They are later captured into spans and traces that represent the communication between services and timings of operations. Each span represents one operation performed by some server. This could be file access, database query, or request handling. We can now extend OpenTelemetry signals with details from databases. We can extract execution plans, correlate them with signals from other layers, and build a full understanding of what happened behind the scenes. For instance, we would clearly see the N+1 problem just by looking at the number of spans. We could immediately identify schema migrations that are too slow or operations that will take the database down. Now, we need the last piece to capture the full picture. Semantic Monitoring of All Databases Observing just the local database may not be enough. The same query may execute differently depending on the configuration or the freshness of statistics. Therefore, we need to integrate monitoring with all the databases we have, especially with the production ones. By extracting statistics, number of rows, running configuration, or installed extensions, we can get an understanding of how the database performs. Next, we can integrate that with the queries we run locally. We take the query that we captured in the local environment and then reason about how it would execute in production. We can compare the execution plan and see which tables are accessed or how many rows are being read. This way, we can immediately tell the developer that the query is not going to scale well in production. Even if the developer has a different database locally or has a low number of rows, we can still take the query or the execution plan, enrich it with the production statistics, and reason about the performance after the deployment. We don’t need to wait for the deployment of the load tests, but we can provide feedback nearly immediately. The most important part is that we move from raw signals to reasoning. We don’t swamp the user with plots or metrics that are hard to understand or that the user can’t use easily without setting the right thresholds. Instead, we can provide meaningful suggestions. Instead of saying, “CPU spiked to 80%,” we can say, “The query scanned the whole table, and you should add an index on this and that column.” We can give developers answers, not only the data points to reason about. Automated Troubleshooting That’s just the beginning. Once we understand what is actually happening in the database, the sky's the limit. We can run anomaly detection on the queries to see how they change over time, if they use the same indexes as before, or if they changed the join strategy. We can catch ORM configuration changes that lead to multiple SQL queries being sent for a particular REST API. We can submit automated pull requests to tune the configuration. We can correlate the application code with the SQL query so we can rewrite the code on the fly with machine-learning solutions. Summary In recent years, we observed a big evolution in the software industry. We run many applications, deploy many times a day, scale out to hundreds of servers, and use more and more components. Application Performance Monitoring is not enough to keep track of all the moving parts in our applications. Here at Metis, we believe that we need something better. We need a true observability that can finally show us the full story. And we can use observability to build database guardrails that provide the actual answers and actionable insights. Not a set of metrics that the developer needs to track and understand, but automated reasoning connecting all the dots. That’s the new approach we need and the new age we deserve as developers owning our databases.
Joana Carvalho
Site Reliability Engineering,
Virtuoso
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere
Chris Ward
Zone Leader,
DZone
Ted Young
Director of Open Source Development,
LightStep