Migrate, Modernize and Build Java Web Apps on Azure: This live workshop will cover methods to enhance Java application development workflow.
Modern Digital Website Security: Prepare to face any form of malicious web activity and enable your sites to optimally serve your customers.
Kubernetes in the Enterprise: The latest expert insights on scaling, serverless, Kubernetes-powered AI, cluster security, FinOps, and more.
Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
A Roadmap to True Observability
Enhancing Observability With AI/ML
In enterprises, SREs, DevOps, and cloud architects often discuss which platform to choose for observability for faster troubleshooting of issues and understanding about performance of their production systems. There are certain questions they need to answer to get maximum value for their team, such as: Will an observability tool support all kinds of workloads and heterogeneous systems? Will the tool support all kinds of data aggregation, such as logs, metrics, traces, topology, etc..? Will the investment in the (ongoing or new) observability tool be justified? In this article, we will provide the best way to get started with unified observability of your entire infrastructure using open-source Skywalking and Istio service mesh. Istio Service Mesh of Multi-Cloud Application Let us take an example of a multi-cloud example where there are multiple services hosted on on-prem or managed Kubernetes clusters. The first step for unified observability will be to form a service mesh using Istio service mesh. The idea is that all the services or workloads in Kubernetes clusters (or VMs) should be accompanied by an Envoy proxy to abstract the security and networking out of business logic. As you can see in the below image, a service mesh is formed, and the network communication between edge to workloads, among workloads, and between clusters is controlled by the Istio control plane. In this case, the Istio service mesh emits a logs, metrics, and traces for each Envoy proxies, which will help to get unified observability. We need a visualization tool like Skywalking to collect the data and populate it for granular observability. Why Skywalking for Observability SREs from large companies such as Alibaba, Lenovo, ABInBev, and Baidu use Apache Skywalking, and the common reasons are: Skywalking aggregates logs, metrics, traces, and topology. It natively supports popular service mesh software like Istio. While other tools may not support getting data from Envoy sidecars, Skywalking supports sidecar integration. It supports OpenTelemetry (OTel) standards for observability. These days, OTel standards and instrumentation are popular for MTL (metrics, logs, traces). Skywalking supports observability-data collection from almost all the elements of the full stack- database, OS, network, storage, and other infrastructure. It is open-source and free (with an affordable enterprise version). Now, let us see how to integrate Istio and Apache skywalking into your enterprise. Steps To Integrate Istio and Apache Skywalking We have created a demo to establish the connection between the Istio data plane and Skywalking, where it will collect data from Envoy sidecars and populate them in the observability dashboards. Note: By default, Skywalking comes with predefined dashboards for Apache APISIX and AWS Gateways. Since we are using Istio Gateway, it will not get a dedicated dashboard out-of-the-box, but we’ll get metrics for it in other locations. If you want to watch the video, check out my latest Istio-Skywalking configuration video. You can refer to the GitHub link here. Step 1: Add Kube-State-Metrics to Collect Metrics From the Kubernetes API Server We have installed kube-state-metrics service to listen to the Kubernetes API server and send those metrics to Apache skywalking. First, add the Prometheus community repo: Shell helm repo add prometheus-community https://prometheus-community.github.io/helm-charts (After every helm repo add, add a line about running helm repo update to fetch the latest charts.) And now you can install kube-state-metrics. Shell helm install kube-state-metrics prometheus-community/kube-state-metrics Step 2: Install Skywalking Using HELM Charts We will install Skywalking version 9.2.0 for this observability demo. You can run the following command to install Skywalking into a namespace (my namespace is skywalking). You can refer to the values.yaml. Shell helm install skywalking oci://registry-1.docker.io.apache/skywalking-helm -f -n skywalking (Optional reading) In helm chart values.yaml, you will notice that: We have made the flag oap (observability analysis platform, i.e., the back-end) and ui configuration as true. Similarly, for databases, we have enabled postgresql as true. For tracking metrics from Envoy access logs, we have configured the following environmental variables: SW_ENVOY_METRIC: default SW_ENVOY_METRIC_SERVICE: true SW_ENVOY_METRIC_ALS_HTTP_ANALYSIS: k8s-mesh,mx-mesh,persistence SW_ENVOY_METRIC_ALS_TCP_ANALYSIS: k8s-mesh,mx-mesh,persistence This is to select the logs and metrics from the Envoy from the Istio configuration (‘c’ and ‘d’ are the rules for analyzing Envoy access logs). We will enable the OpenTelemetry receiver and configure it to receive data in otlp format. We will also enable OTel rules according to the data we will send to Skywalking. In a few moments (in Step 3), we will configure the OTel collector to scrape istiod, k8s, kube-state-metrics, and the Skywalking OAP itself. So, we have enabled the appropriate rules: SW_OTEL_RECEIVER: default SW_OTEL_RECEIVER_ENABLED_HANDLERS: “otlp” SW_OTEL_RECEIVER_ENABLED_OTEL_RULES: “istio-controlplane,k8s-cluster,k8s-node,k8s-service,oap” SW_TELEMETRY: prometheus SW_TELEMETRY_PROMETHEUS_HOST: 0.0.0.0 SW_TELEMETRY_PROMETHEUS_PORT: 1234 SW_TELEMETRY_PROMETHEUS_SSL_ENABLED: false SW_TELEMETRY_PROMETHEUS_SSL_KEY_PATH: “” SW_TELEMETRY_PROMETHEUS_SSL_CERT_CHAIN_PATH: “” We have instructed Skywalking to collect data from the Istio control plance, Kubernetes cluster, node, services, and also oap (Observability Analytics Platform by Skywalking).(The configurations from ‘d’ to ‘i’ enable Skywalking OAP’s self-observability, meaning it will expose Prometheus-compatible metrics at port 1234 with SSL disabled. Again, in Step 3, we will configure the OTel collector to scrape this endpoint.) In the helm chart, we have also enabled the creation of a service account for Skywalking OAP. Step 3: Setting Up Istio + Skywalking Configuration After that, we can install Istio using this IstioOperator configuration. In the IstioOperator configuration, we have set up the meshConfig so that every Sidecar will have enabled the envoy access logs service and set the address for access logs service and metrics service to skywalking. Additionally, with the proxyStatsMatcher, we are configuring all metrics to be sent via the metrics service. YAML meshConfig: defaultConfig: envoyAccessLogService: address: "skywalking-skywalking-helm-oap.skywalking.svc:11800" envoyMetricsService: address: "skywalking-skywalking-helm-oap.skywalking.svc:11800" proxyStatsMatcher: inclusionRegexps: - .* enableEnvoyAccessLogService: true Step 4: OpenTelemetry Collector Once the Istio and Skywalking configuration is done, we need to feed metrics from applications, gateways, nodes, etc, to Skywalking. We have used the opentelemetry-collector.yaml to scrape the Prometheus compatible endpoints. In the collector, we have mentioned that OpenTelemetry will scrape metrics from istiod, Kubernetes-cluster, kube-state-metrics, and skywalking. We have created a service account for OpenTelemetry. Using opentelemetry-serviceaccount.yaml, we have set up a service account, declared ClusterRole and ClusterRoleBinding to define what all actions the opentelemetry service account will be able to take on various resources in our Kubernetes cluster. Once you deploy the opentelemetry-collector.yaml and opentelemetry-serviceaccount.yaml, there will be data flowing into Skywalking from- Envoy, Kubernetes cluster, kube-state-metrics, and Skywalking (oap). Step 5: Observability of Kubernetes Resources and Istio Resource in Skywalking To check the UI of Skywalking, port-forward the Skywalking UI service to port (say 8080). Run the following command: Shell kubectl port-forward svc/skywalking-skywalking-helm-ui -n skywalking 8080:80 You can open the Skywalking UI service at localhost:8080. (Note: For setting up load to services and see the behavior and performance of apps, cluster, and Envoy proxy, check out the full video. ) Once you are on the Skywalking UI (refer to the image below), you can select service mesh in the left-side menu and select control plane or data plane. Skywalking would provide all the resource consumption and observability data of Istio control and data plane, respectively. Skywalking Istio-dataplane provides info about all the Envoy proxies attached to services. Skywalking provides metrics, logs, and traces of all the Envoy proxies. Refer to the below image, where all the observability details are displayed for just one service-proxy. Skywalking provides the resource consumption of Envoy proxies in various namespaces. Similarly, Skywalking also provides all the observable data of the Istio control plane. Note, in case you have multiple control planes in different namespaces (in multiple clusters), you just provide the access Skywalking oap service. Skywalking provides Istio control planes like metrics, number of pilot pushes, ADS monitoring, etc. Apart from the Istio service mesh, we also configured Skywalking to fetch information about the Kubernetes cluster. You can see in the below image Skywalking provides all the info about the Kubernetes dashboard, such as the number of nodes, pods, K8s deployments, services, pods, containers, etc. You also get the respective resource utilization metrics of each K8 resource in the same dashboard. Skywalking provides holistic information about a Kubernetes cluster. Similarly, you can drill further down into a service in the Kubernetes cluster and get granular information about their behavior and performance. (refer to the below images.) For setting up load to services and seeing the behavior and performance of apps, cluster, and Envoy proxy, check out the full video. Benefits of Istio Skywalking Integrations There are several benefits of integrating Istio and Apache Skywalking for Unified observability. Ensure 100% visibility of the technology stack, including apps, sidecars, network, database, OS, etc. Reduce 90% of the time to find the root cause (MTTR) of issues or anomalies in production with faster troubleshooting. Save approximately ~$2M of lifetime spend on closed-source solutions, complex pricing, and custom integrations.
The ability to measure the internal states of a system by examining its outputs is called Observability. A system becomes 'observable' when it is possible to estimate the current state using only information from outputs, namely sensor data. You can use the data from Observability to identify and troubleshoot problems, optimize performance, and improve security. In the next few sections, we'll take a closer look at the three pillars of Observability: Metrics, Logs, and Traces. What Is the Difference Between Observability and Monitoring? ‘Observability wouldn’t be possible without monitoring.’ Monitoring is another term that closely relates to observability. The major difference between Monitoring and Observability is that the latter refers to the ability to gain insights into the internal workings of a system, while the former refers to the act of collecting data on system performance and behavior. In addition to that, Monitoring doesn't really think about the end goal. It focuses on predefined metrics and thresholds to detect deviations from expected behavior. Observability aims to provide a deep understanding of system behavior, allowing exploration and discovery of unexpected issues. In terms of perspective and mindset, Monitoring adopts a "top-down" approach with predefined alerts based on known criteria. Observability takes a "bottom-up" approach, encouraging open-ended exploration and adaptability to changing requirements. Observability Monitoring Tells you why a system is at fault. Notifies that you have a system at fault. Acts as a knowledge base to define what needs monitoring. Focuses only on monitoring systems and detecting faults across them. Focuses on giving context to data. Data collection focused. Give a more complete assessment of the overall environment. Keeping track of monitoring KPIs. Observability is a traversable map. Monitoring is a single plane. It gives you complete information. It gives you limited information. Observability creates the potential to monitor different events. Monitoring is the process of using Observability. Monitoring detects anomalies and alerts you to potential problems. However, Observability detects issues and helps you understand their root causes and underlying dynamics. Three Pillars of Observability Observability, built on the Three Pillars (Metrics, Logs, Traces), revolves around the core concept of "Events." Events are the fundamental units of monitoring and telemetry, each time-stamped and quantifiable. What distinguishes events is their context, especially in user interactions. For example, when a user clicks "Pay Now" on an eCommerce site, this action is an event expected within seconds. In monitoring tools, "Significant Events" are key. They trigger: Automated Alerts: Notifying SREs or operations teams. Diagnostic Tools: Enabling root-cause analysis. Imagine a server's disk nearing 99% capacity; it's significant, but understanding which applications and users cause this is vital for effective action. 1. Metrics Metrics serve as numeric indicators, offering insights into a system's health. While some metrics like CPU, memory, and disk usage are obvious system health indicators, numerous other critical metrics can uncover underlying issues. For instance, a gradual increase in OS handles can lead to a system slowdown, eventually necessitating a reboot for accessibility. Similar valuable metrics exist throughout the various layers of the modern IT infrastructure. Careful consideration is crucial when determining which metrics to continuously collect and how to analyze them effectively. This is where domain expertise plays a pivotal role. While most monitoring tools can detect evident issues, the best ones go further by providing insights into detecting and alerting complex problems. It's also essential to identify the subset of metrics that serve as proactive indicators of impending system problems. For instance, an OS handle leak rarely occurs abruptly. Tracking the gradual increase in the number of handles in use over time makes it possible to predict when the system might become unresponsive, allowing for proactive intervention. Advantages of Metrics Challenges of Metrics Quantitative and intuitive for setting alert thresholds Lightweight and cost-effective for storage Excellent for tracking trends and system changes Provides real-time component state data Constant overhead cost; not affected by data surges Limited insight into the "why" behind issues Lack context of individual interactions or events Risk of data loss in case of collection/storage failure Fixed interval collection may miss critical details Excessive sampling can impact performance and costs 2. Logs Logs frequently contain intricate details about how an application processes requests. Unusual occurrences, such as exceptions, within these logs can signal potential issues within the application. It's a vital aspect of any observability solution to monitor these errors and exceptions in logs. Parsing logs can also reveal valuable insights into the application's performance. Logs often hold insights that may remain elusive when using APIs (Application Programming Interfaces) or querying application databases. Many Independent Software Vendors (ISVs) don't offer alternative methods to access the data available in logs. Therefore, an effective observability solution should enable log analysis and facilitate the capture of log data and its correlation with metric and trace data. Advantages of Logs Challenges of Logs Easy to generate, typically timestamp + plain text Often require minimal integration by developers Most platforms offer standardized logging frameworks Human-readable, making them accessible Provide granular insights for retrospective analysis Can generate large data volumes, leading to costs Impact on application performance, especially without asynchronous logging Retrospective use, not proactive Persistence challenges in modern architectures Risk of log loss in containers and auto-scaling environments 3. Traces Tracing is a relatively recent development, especially suited to the complex nature of contemporary applications. It works by collecting information from different parts of the application and putting it together to show how a request moves through the system. A trace is represented as spans: span A is the root span, and span B is a child of span A. The primary advantage of tracing lies in its ability to deconstruct end-to-end latency and attribute it to specific tiers or components. While it can't tell you exactly why there's a problem, it's great for figuring out where to look. Advantages of Traces Challenges of Traces Ideal for pinpointing issues within a service Offers end-to-end visibility across multiple services Identifies performance bottlenecks effectively Aids debugging by recording request/response flows Provides contextual insights into system behavior Limited ability to reveal long-term trends Complex systems may yield diverse trace paths Doesn't explain the cause of slow or failing spans (steps) Adds overhead, potentially impacting system performance Integrating tracing used to be difficult, but with service meshes, it's now effortless. Service meshes handle tracing and stats collection at the proxy level, providing seamless observability across the entire mesh without requiring extra instrumentation from applications within it. Each above discussed component has its pros and cons, even though one might want to use them all. Observability Tools Observability tools gather and analyze data related to user experience, infrastructure, and network telemetry to proactively address potential issues, preventing any negative impact on critical business key performance indicators (KPIs). Observability Survey Report 2023 - key findings Some popular observability tooling options include: Prometheus: A leading open-source monitoring and alerting toolkit known for its scalability and support for multi-dimensional data collection. Grafana: A visualization and dashboarding platform often used with Prometheus, providing rich insights into system performance. Jaeger: An open-source distributed tracing system for monitoring and troubleshooting microservices-based architectures. Elasticsearch: A search and analytics engine that, when paired with Kibana and Beats, forms the ELK Stack for log management and analysis. Honeycomb: An event-driven observability tool that offers real-time insights into application behavior and performance. Datadog: A cloud-based observability platform that integrates logs, metrics, and traces, providing end-to-end visibility. New Relic: Offers application performance monitoring (APM) and infrastructure monitoring solutions to track and optimize application performance. Sysdig: Focused on container monitoring and security, Sysdig provides deep visibility into containerized applications. Zipkin: An open-source distributed tracing system for monitoring request flows and identifying latency bottlenecks. Conclusion Logs, metrics, and traces are essential Observability pillars that work together to provide a complete view of distributed systems. Incorporating them strategically, such as placing counters and logs at entry and exit points and using traces at decision junctures, enables effective debugging. Correlating these signals enhances our ability to navigate metrics, inspect request flows, and troubleshoot complex issues in distributed systems.
This is an article from DZone's 2023 Database Systems Trend Report.For more: Read the Report Hearing the vague statement, "We have a problem with the database," is a nightmare for any database manager or administrator. Sometimes it's true, sometimes it's not, and what exactly is the issue? Is there really a database problem? Or is it a problem with networking, an application, a user, or another possible scenario? If it is a database, what is wrong with it? Figure 1: DBMS usage Databases are a crucial part of modern businesses, and there are a variety of vendors and types to consider. Databases can be hosted in a data center, in the cloud, or in both for hybrid deployments. The data stored in a database can be used in various ways, including websites, applications, analytical platforms, etc. As a database administrator or manager, you want to be aware of the health and trends of your databases. Database monitoring is as crucial as databases themselves. How good is your data if you can't guarantee its availability and accuracy? Database Monitoring Considerations Database engines and databases are systems hosted on a complex IT infrastructure that consists of a variety of components: servers, networking, storage, cables, etc. Database monitoring should be approached holistically with consideration of all infrastructure components and database monitoring itself. Figure 2: Database monitoring clover Let's talk more about database monitoring. As seen in Figure 2, I'd combine monitoring into four pillars: availability, performance, activity, and compliance. These are broad but interconnected pillars with overlap. You can add a fifth "clover leaf" for security monitoring, but I include that aspect of monitoring into activity and compliance, for the same reason capacity planning falls into availability monitoring. Let's look deeper into monitoring concepts. While availability monitoring seems like a good starting topic, I will deliberately start with performance since performance issues may render a database unavailable and because availability monitoring is "monitoring 101" for any system. Performance Monitoring Performance monitoring is the process of capturing, analyzing, and alerting to performance metrics of hardware, OS, network, and database layers. It can help avoid unplanned downtimes, improve user experience, and help administrators manage their environments efficiently. Native Database Monitoring Most, if not all, enterprise-grade database systems come with a set of tools that allow database professionals to examine internal and/or external database conditions and the operational status. These are system-specific, technical tools that require SME knowledge. In most cases, they are point-in-time performance data with limited or non-existent historical value. Some vendors provide additional tools to simplify performance data collection and analysis. With an expansion of cloud-based offerings (PaaS or IaaS), I've noticed some improvements in monitoring data collection and the available analytics and reporting options. However, native performance monitoring is still a set of tools for a database SME. Enterprise Monitoring Systems Enterprise monitoring systems (EMSs) offer a centralized approach to keeping IT systems under systematic review. Such systems allow monitoring of most IT infrastructure components, thus consolidating supervised systems with a set of dashboards. There are several vendors offering comprehensive database monitoring systems to cover some or all your monitoring needs. Such solutions can cover multiple database engines or be specific to a particular database engine or a monitoring aspect. For instance, if you only need to monitor SQL servers and are interested in the performance of your queries, then you need a monitoring system that identifies bottlenecks and contentions. Let's discuss environments with thousands of database instances (on-premises and in a cloud) scattered across multiple data centers across the globe. This involves monitoring complexity growth with a number of monitored devices, database type diversity, and geographical locations of your data centers and actual data that you monitor. It is imperative to have a global view of all database systems under one management and an ability to identify issues, preferably before they impact your users. EMSs are designed to help organizations align database monitoring with IT infrastructure monitoring, and most solutions include an out-of-the-box set of dashboards, reports, graphs, alerts, useful tips, and health history and trends analytics. They also have pre-set industry-outlined thresholds for performance counters/metrics that should be adjusted to your specific conditions. Manageability and Administrative Overhead Native database monitoring is usually handled by a database administrator (DBA) team. If it needs to be automated, expanded, or have any other modifications, then DBA/development teams would handle that. This can be efficiently managed by DBAs in a large enterprise environment on a rudimental level for internal DBA specific use cases. Bringing in a third-party system (like an EMS) requires management. Hypothetically, a vendor has installed and configured monitoring for your company. That partnership can continue, or internal personnel can take over EMS management (with appropriate training). There is no "wrong" approach — it solely depends on your company's operating model and is assessed accordingly. Data Access and Audit Compliance Monitoring Your databases must be secure! Unauthorized access to sensitive data could be as harmful as data loss. Data breaches, malicious activities (intentional or not) — no company would be happy with such publicity. That brings us to audit compliance and data access monitoring. There are many laws and regulations around data compliance. Some are common between industries, some are industry-specific, and some are country-specific. For instance, SOX compliance is required for all public companies in numerous countries, and US healthcare must follow HIPAA regulations. Database management teams must implement a set of policies, procedures, and processes to enforce laws and regulations applicable to their company. Audit reporting could be a tedious and cumbersome process, but it can and should be automated. While implementing audit compliance and data access monitoring, you can improve your database audit reporting, as well — it's virtually the same data set. What do we need to monitor to comply with various laws and regulations? These are normally mandatory: Access changes and access attempts Settings and/or objects modifications Data modifications/access Database backups Who should be monitored? Usually, access to make changes to a database or data is strictly controlled: Privileged accounts – usually DBAs; ideally, they shouldn't be able to access data, but that is not always possible in their job so activity must be monitored Service accounts – either database or application service accounts with rights to modify objects or data "Power" accounts – users with rights to modify database objects or data "Lower" accounts – accounts with read-only activity As with performance monitoring, most database engines provide a set of auditing tools and mechanisms. Another option is third-party compliance software, which uses database-native auditing, logs, and tracing to capture compliance-related data. It provides audit data storage capabilities and, most importantly, a set of compliance reports and dashboards to adhere to a variety of compliance policies. Compliance complexity directly depends on regulations that apply to your company and the diversity and size of your database ecosystem. While we monitor access and compliance, we want to ensure that our data is not being misused. An adequate measure should be in place for when unauthorized access or abnormal data usage is detected. Some audit compliance monitoring systems provide means to block abnormal activities. Data Corruption and Threats Database data corruption is a serious issue that could lead to a permanent loss of valuable data. Commonly, data corruption occurs due to hardware failures, but it could be due to database bugs or even bad coding. Modern database engines have built-in capabilities to detect and sometimes prevent data corruption. Data corruption will generate an appropriate error code that should be monitored and highlighted. Checking database integrity should be a part of the periodical maintenance process. Other threats include intentional or unintentional data modification and ransomware. While data corruption and malicious data modification can be detected by DBAs, ransomware threats fall outside of the monitoring scope for database professionals. It is imperative to have a bulletproof backup to recover from those threats. Key Database Performance Metrics Database performance metrics are extremely important data points that measure the health of database systems and help database professionals maintain efficient support. Some of the metrics are specific to a database type or vendor, and I will generalize them as "internal counters." Availability The first step in monitoring is to determine if a device or resource is available. There is a thin line between system and database availability. A database could be up and running, but clients may not be able to access it. With that said, we need to monitor the following metrics: Network status – Can you reach the database over the network? If yes, what is the latency? While network status may not commonly fall into the direct responsibility of a DBA, database components have configuration parameters that might be responsible for a loss of connectivity. Server up/down Storage availability Service up/down – another shared area between database and OS support teams Whether the database is online or offline CPU, Memory, Storage, and Database Internal Metrics The next important set of server components which could, in essence, escalate into an availability issue are CPU, memory, and storage. The following four performance areas are tightly interconnected and affect each other: Lack of available memory High CPU utilization Storage latency or throughput bottleneck Set of database internal counters which could provide more content to utilization issues For instance, lack of memory may force a database engine to read and write data more frequently, creating contention on the IO system. 100% CPU utilization could often cause an entire database server to stop responding. Numerous database internal counters can help database professionals analyze use trends and identify an appropriate action to mitigate potential impact. Observability Database observability is based on metrics, traces, and logs — what we supposedly collected based on the discussion above. There are a plethora of factors that may affect system and application availability and customer experience. Database performance metrics are just a single set of possible failure points. Supporting the infrastructure underneath a database engine is complex. To successfully monitor a database, we need to have a clear picture of the entire ecosystem and the state of its components while monitoring. Relevant performance data collected from various components can be a tremendous help in identifying and addressing issues before they occur. The entire database monitoring concept is data driven, and it is our responsibility to make it work for us. Monitoring data needs to tell us a story that every consumer can understand. With database observability, this story can be transparent and provide a clear view of your database estate. Balanced Monitoring As you could gather from this article, there are many points of failure in any database environment. While database monitoring is the responsibility of database professionals, it is a collaborative effort of multiple teams to ensure that your entire IT ecosystem is operational. So what's considered "too much" monitoring and when is it not enough? I will use DBAs' favorite phrase: it depends. Assess your environment – It would be helpful to have a configuration management database. If you don't, create a full inventory of your databases and corresponding applications: database sizes, number of users, maintenance schedules, utilization times — as many details as possible. Assess your critical systems – Outline your critical systems and relevant databases. Most likely those will fall into a category of maximum monitoring: availability, performance, activity, and compliance. Assess your budget – It's not uncommon to have a tight cash flow allocated to IT operations. You may or may not have funds to purchase a "we-monitor-everything" system, and certain monitoring aspects would have to be developed internally. Find a middle ground – Your approach to database monitoring is unique to your company's requirements. Collecting monitoring data that has no practical or actionable applications is not efficient. Defining actionable KPIs for your database monitoring is a key to finding a balance — monitor what your team can use to ensure systems availability, stability, and satisfied customers. Remember: Successful database monitoring is data-driven, proactive, continuous, actionable, and collaborative. This is an article from DZone's 2023 Database Systems Trend Report.For more: Read the Report
This is an article from DZone's 2023 Data Pipelines Trend Report.For more: Read the Report Organizations today rely on data to make decisions, innovate, and stay competitive. That data must be reliable and trustworthy to be useful. Many organizations are adopting a data observability culture that safeguards their data accuracy and health throughout its lifecycle. This culture involves putting in motion a series of practices that enable you and your organization to proactively identify and address issues, prevent potential disruptions, and optimize their data ecosystems. When you embrace data observability, you protect your valuable data assets and maximize their effectiveness. Understanding Data Observability "In a world deluged by irrelevant information, clarity is power.”- Yuval Noah Harari, 21 Lessons for the 21st Century, 2018 As Yuval Noah Harari puts it, data is an incredibly valuable asset today. As such, organizations must ensure that their data is accurate and dependable. This is where data observability comes in, but what is data observability exactly? Data observability is the means to ensure our data's health and accuracy, which means understanding how data is collected, stored, processed, and used, plus being able to discover and fix issues in real time. By doing so, we can optimize our system's effectiveness and reliability by identifying and addressing discrepancies while ensuring compliance with regulations like GDPR or CCPA. We can gather valuable insights that prevent errors from recurring in the future by taking such proactive measures. Why Is Data Observability Critical? Data reliability is vital. We live in an era where data underpins crucial decision-making processes, so we must safeguard it against inaccuracies and inconsistencies to ensure our information is trustworthy and precise. Data observability allows organizations to proactively identify and address issues before they can spread downstream, preventing potential disruptions and costly errors. One of the advantages of practicing data observability is that it'll ensure your data is reliable and trustworthy. This means continuously monitoring your data to avoid making decisions based on incomplete or incorrect information, giving you more confidence. Figure 1: The benefits of companies using analytics Data source: The Global State of Enterprise Analytics, 2020, MicroStrategy Analyzing your technology stack can also help you find inefficiencies and areas where resources are underutilized, saving you money. But incorporating automation tools into your data observability process is the cherry on top of the proverbial cake, making everything more efficient and streamlined. Data observability is a long-run approach to safeguarding the integrity of your data so that you can confidently harness its power, whether it's for informed decision-making, regulatory compliance, or operational efficiency. Advantages and Disadvantages of Data Observability When making decisions based on data, it's essential to be quick. But what if the data isn't dependable? That's where data observability comes in. However, like any tool, it has its advantages and disadvantages. IMPLEMENTING DATA OBSERVABILITY: ADVANTAGES AND DISADVANTAGES Advantages Disadvantages Trustworthy insights for intelligent decisions: Data observability provides decision-makers with reliable insights, ensuring well-informed choices in business strategy, product development, and resource allocation. Resource-intensive setup: Implementing data observability demands time and resources to set up tools and processes, but the long-term benefits justify the initial costs. Real-time issue prevention: Data observability acts as a vigilant guardian for your data, instantly detecting issues and averting potential emergencies, thus saving time and resources while maintaining data reliability. Computational overhead from continuous monitoring: Balancing real-time monitoring with computational resources is essential to optimize observability. Enhanced team alignment through shared insights: Data observability fosters collaboration by offering a unified platform for teams to gather, analyze, and act on data insights, facilitating effective communication and problem-solving. Training requirements for effective tool usage: Data observability tools require skill, necessitating ongoing training investments to harness their full potential. Accurate data for sustainable planning: Data observability establishes the foundation for sustainable growth by providing dependable data that's essential for long-term planning, including forecasting and risk assessment. Privacy compliance challenges: Maintaining data observability while adhering to strict privacy regulations like GDPR and CCPA can be intricate, requiring a delicate balance between data visibility and privacy compliance. Resource savings: Data observability allows you to improve how resources are allocated by identifying areas where your technology stack is inefficient or underutilized. As a result, you can save costs and prevent over-provisioning resources, leading to a more efficient and cost-effective data ecosystem. Integration complexities: Integrating data observability into existing data infrastructure may pose challenges due to compatibility issues and legacy systems, potentially necessitating investments in specific technologies and external expertise for seamless integration. Table 1 To sum up, data observability has both advantages and disadvantages, such as providing reliable data, detecting real-time problems, and enhancing teamwork. However, it requires significant time, resources, and training while respecting data privacy. Despite these challenges, organizations that adopt data observability are better prepared to succeed in today's data-driven world and beyond. Cultivating a Data-First Culture Data plays a crucial role in today's fast-paced and competitive business environment. It enables informed decision-making and drives innovation. To achieve this, it's essential to cultivate an environment that values data. This culture should prioritize accuracy, dependability, and consistent monitoring throughout the data's lifecycle. To ensure effective data observability, strong leadership is essential. Leaders should prioritize data from the top down, allocate necessary resources, and set a clear vision for a data-driven culture. This leadership fosters team collaboration and alignment, encouraging them to work together towards the same objectives. When teams collaborate in a supportive work environment, critical data is properly managed and utilized for the organization's benefit. Technical teams and business users must work together to create a culture that values data. Technical teams build the foundation of data infrastructure while business users access data to make decisions. Collaboration between these teams leads to valuable insights that drive business growth. Figure 2: Data generated, gathered, copied, and consumed Data source: Data and Analytics Leadership Annual Executive Survey 2023, NewVantage Partners By leveraging data observability, organizations can make informed decisions, address issues quickly, and optimize their data ecosystem for the benefit of all stakeholders. Nurturing Data Literacy and Accountability Promoting data literacy and accountability is not only about improving efficiency but also an ethical consideration. Assigning both ownership and accountability for data management empowers people to make informed decisions based on data insights, strengthens transparency, and upholds principles of responsibility and integrity, ensuring accuracy, security, and compliance with privacy regulations. A data-literate workforce is a safeguard, identifying instances where data may be misused or manipulated for unethical purposes. Figure 3: The state of data responsibility and data ethics Data source: Amount of data created, consumed, and stored 2010- 2020, with forecasts to 2025, 2023, Statistica Overcoming Resistance To Change Incorporating observability practices is often a considerable challenge, and facing resistance from team members is not uncommon. However, you should confront these concerns and communicate clearly to promote a smooth transition. You can encourage adopting data-driven practices by highlighting the long-term advantages of better data quality and observability, which might inspire your coworkers to welcome changes. Showcasing real-life cases of positive outcomes, like higher revenue and customer satisfaction, can also help make a case. Implementing Data Observability Techniques You can keep your data pipelines reliable and at a high quality by implementing data observability. This implementation involves using different techniques and features that will allow you to monitor and analyze your data. Those processes include data profiling, anomaly detection, lineage, and quality checks. These tools will give you a holistic view of your data pipelines, allowing you to monitor its health and quickly identify any issues or inconsistencies that could affect its performance. Essential Techniques for Successful Implementation To ensure the smooth operation of pipelines, you must establish a proper system for monitoring, troubleshooting, and maintaining data. Employing effective strategies can help achieve this goal. Let's review some key techniques to consider. Connectivity and Integration For optimal data observability, your tools must integrate smoothly with your existing data stack. This integration should not require major modifications to your pipelines, data warehouses, or processing frameworks. This approach allows for an easy deployment of the tools without disrupting your current workflows. Data Monitoring at Rest Observability tools should be able to monitor data while it's at rest without needing to extract it from the current storage location. This method ensures that the monitoring process doesn't affect the speed of your data pipelines and is cost effective. Moreover, this approach makes your data safer as it doesn't require extraction. Automated Anomaly Detection Automated anomaly detection is an important component of data observability. Through machine learning models, patterns and behaviors in data are identified; this enables alerts to be sent when unexpected deviations occur, reducing the number of false positives and alleviating the workload of data engineers who would otherwise have to manage complex monitoring rules. Dynamic Resource Identification Data observability tools give you complete visibility into your data ecosystem. These tools should automatically detect important resources, dependencies, and invariants. They should be flexible enough to adapt to changes in your data environment, giving you insights into vital components without constant manual updates and making data observability extensive and easy to configure. Comprehensive Contextual Information For effective troubleshooting and communication, data observability needs to provide comprehensive contextual information. This information should cover data assets, dependencies, and reasons behind any data gaps or issues. Having the full context will allow data teams to identify and resolve any reliability concerns quickly. Preventative Measures Data observability implements monitoring data assets and offers preventive measures to avoid potential issues. With insights into data and suggesting responsible alterations or revisions, you can proactively address problems before they affect data pipelines. This approach leads to greater efficiency and time savings in the long run. If you need to keep tabs on data, it can be tough to ensure everything is covered. Only using batch and stream processing frameworks isn't enough. That's why it's often best to use a tool specifically made for this purpose. You could use a data platform, add it to your existing data warehouse, or opt for open-source tools. Each of these options has its own advantages and disadvantages: Use a data platform– Data platforms are designed to manage all of your organization's data in one place and grant access to that data through APIs instead of via the platform itself. There are many benefits to using a data platform, including speed, easy access to all your organization's information, flexible deployment options, and increased security. Additionally, many platforms include built-in capabilities for data observability, so you can ensure your databases perform well without having to implement an additional solution. Build data observability into your existing platform – If your organization only uses one application or tool to manage its data, this approach is probably the best for you, provided it includes an observability function. Incorporating data observability into your current setup is a must-have if you manage complex data stored in multiple sources, thus improving the reliability of your data flow cycle. Balancing Automation and Human Oversight Figure 4: Balancing automation and human oversight While automation is a key component of data observability, it's important to strike a balance between automation and human oversight. While automation can help with routine tasks, human expertise is necessary for critical decisions and ensuring data quality. Implementing data observability techniques involves seamless integration, automated anomaly detection, dynamic resource identification, and comprehensive contextual information. Balancing automation and human oversight is important for efficient and effective data observability, resulting in more reliable data pipelines and improved decision-making capabilities. Conclusion In conclusion, data observability empowers organizations to thrive in a world where data fuels decision-making by ensuring data's accuracy, reliability, and trustworthiness. We can start by cultivating a culture that values data integrity, collaboration between technical and business teams, and a commitment to nurturing data literacy and accountability. You will also need a strong data observability framework to monitor your data pipelines effectively. This includes a set of techniques that will help identify issues early and optimize your data ecosystems. But automated processes aren't enough, and we must balance our reliance on automation with human oversight, recognizing that while automation streamlines routine tasks, human expertise remains invaluable for critical decisions and maintaining data quality. With data observability, data integrity is safeguarded, and its full potential is unlocked — leading to innovation, efficiency, and success. This is an article from DZone's 2023 Data Pipelines Trend Report.For more: Read the Report
As previously mentioned, last week I was on-site at the PromCon EU 2023 event for two days in Berlin, Germany. This is a community-organized event focused on the technology and implementations around the open-source Prometheus project, including, for example, PromQL and PromLens. Below you'll find an overview covering insights into the talks given, often with a short recap if you don't want to browse the details. Along with the talks, it was invaluable to have the common discussions and chats that happen between talks in the breaks where you can connect with core maintainers of various aspects of the Prometheus project. Be sure to keep an eye on the event video playlist, as all sessions were recorded and will appear there. Let's dive right in and see what the event had to offer this year in Berlin. This overview will be my impressions of each day of the event, but not all the sessions will be covered. Let's start with a short overview of the insights taken after sessions, chats, and the social event: OpenTelemetry interoperability (in all flavors) is the hot topic of the year. Native Histograms were a big topic the last two years; this year, showing up as having a lot of promise here and there, but not a big topic in this year's talks. Perses dashboard and visualization project presented their Alpha release as a truly open-source project based on the Apache 2.0 license. By my count, there were ~150 attendees, and they also live-streamed all talks/lightning talks, which will also be made available on their YouTube channel post-event. Day 1 The day started with a lovely walk through the center of Berlin and to the venue located on the Spree River. The event opened and jumped right into the following series of talks (insights provided inline): What's New in Prometheus and Its Ecosystem Native Histograms - Efficiency and more details Documentation note on prometheus.io: "...Native histograms (added as an experimental feature in Prometheus v2.40). Once native histograms are closer to becoming a stable feature, this document will be thoroughly updated." stringlabels - Storing labels differently for significant memory reduction keep_firing_for field faded to alerting rules - How long an alert will continue firing after the condition has occurred scrape_config_files - Split prom scrape configs into multiple files, avoiding having to have big config files OTLP receiver (v2.47) - Experimental support for receiving OTLP metrics SNMP Exporter (v0.24) - Breaking changes: new configuration format; splits connection settings from metrics details, simpler to change. Also added the ability to query multiple modules in one scrape using just one scrape. MySQLd Exporter (v0.15) - Multi-target support, use a single exporter to monitor multiple MySQL-alike servers Java client (v1.0.0) - client_java with OpenTelemetry metrics and tracing support, Native Histograms Alertmanager - New receivers. MS Teams, Discord, Webex Windows Exporter - Now an official exporter; was delayed due to licensing but is in the final stages now Every Tuesday Prometheus meets for Bug Scrub at 11:00 UTC. Calendar https://promtheus.io/community. What’s Coming New AlertManager UI Metadata Improvements Exemplary Improvements Remote Write v2 Perses: The CNCF Candidate for Observability Visualization Summary An announcement was given of the Alpha launch of the Perses dashboard and visualization project with GitOps compatibility - purpose-built for observability data; a truly open-source alternative with the Apache 2.0 license. Perses was born from the CNCF landscape missing visualization tooling projects: Perses - An exploration of a standard dashboard format Chronosphere, Red Hat, and Amadeus are displayed as founding members GitOps friendly, static validation, Kubernetes support; you can use the Perses binary in your development environment Chronosphere supported its development and Red Hat is integrating the Perses package into the OpenShift Console. There is an exploration of its usage with Prometheus/PromLens. Currently only metrics display, but ongoing by Red Hat integrating tracing with OpenTelemetry Logs are on the future wishlist. Feature details presented for the development of dashboards Includes Grafana migration tooling I was chatting with core maintainer Augustin Husson after the talk, and they are interested in submitting Perses as an applicant for the CNCF Sandbox status. Towards Making Prometheus OpenTelemetry Native Summary OpenTelemetry protocol (OTLP) support in Prometheus for metrics ingestion is experimental. Details on the Effort OTLP ingestion is there experimentally. The experience with target_info is a big pain point at the moment. Takes about half the bandwidth of remote write, 30-40% more CPU due to gzip New Arrow-based OTLP protocol promises half the bandwidth again at half the CPU cost; may inspire Prometheus remote write 2.0 GitHub milestone to track Thinking about using collector remote config to solve "split configuration" between Prometheus server and OpenTelemetry clients Planet Scale Monitoring: Handling Billions of Active Series With Prometheus and Thanos Summary Shopify states they are running “highly scalable globally distributed and highly dynamic” cloud infrastructure, so they are on “Planet Scale” with Prometheus. Details on the Effort Huge Ruby shop, latency-sensitive, large scaling events around the retail cycle and flash sales HPA struggles with scaling up quickly enough Using StatsD to get around Ruby/Python/PHP-specific limitations on shared counters Backend is Thanos-based, but have added a lot on top of it (custom work) Have a custom operator to scale Prometheus agents by scraping the targets and seeing how many time series they have (including redistribution) Have a router layer on top of Thanos to decouple ingestion and storage; sounds like they're evolving into a a Mimir-like setup Split the query layer into two deployments: one for short-term queries and one for longer-term queries Team and service-centric UI for alerting, integrated with SLO tracking Native histograms solved cardinality challenges and combined with Thanos' distributed querier to make very high cardinality queries work; as they stated, "This changed the game for us." When migrating from the previous observability vendor, they decided not to convert dashboards; instead, worked with developers to build new cleaner ones. Developers are not scoping queries well, so most fan out to all regional stores, but performance on empty responses is satisfactory, so it's not a big issue. Lightning Talks Summary It's always fun to end the day with a quick series of talks that are ad-hoc collected from the attendees. Below is a list of ones I thought were interesting as well as a short summary, should you want to find them in the recordings: AlertManager UI: Alertmanager will get a new UI in React. ELM didn't get traction as a common language; considering alternatives to Bootstrap Implementing integrals with Prometheus and Grafana: Integrals in PromQL- inverse of rates, Pure-PromQL version of the delta counter we do; using sum_over_time and Grafana variables to simplify getting all the right factors. Metrics have a DX Problem: Looking at how to do developer-focused metrics from the IDE using autometrics-dev project on Git Hub; framework for instrumenting by function, with IDE integration to explore prod metrics; interesting idea to integrate this deeply Day 2 After the morning walk through the center of Berlin, day two provided us with some interesting material (insights provided inline): Taming the Tsunami: Low Latency Ingestion of Push-Based Metrics in Prometheus Summary Overview of the metrics story at Shopify, with over 1k teams running it: Originally forwarding metrics "from observability vendor agent" Issues because that was multiplying the cardinality across exporter instances; same with sidecar model Built a StatsD protocol-aware load balancer Running as a sidecar also had ownership issues, stating, "We would be on call for every application" DaemonSet deployment meant resource usage and hot-spotting concerns; also cardinality, but at a lower level Didn't want per-instance metrics because of cardinality and metrics are more domain-level Roughly one exporter per 50-100 nodes Load balancer sanitizes label values and drops labels Pre-aggregation on short time scales to deal with "hot loop instrumentation;" resulted in roughly 20x reduction in bandwidth use Compensating for lack of per-instance metrics by looking at infrastructure metrics (KSM, cAdvisor) "We have close to a thousand teams right now" Prometheus Java Client 1.0.0 Summary V1.0.0 was released last week. This talk was an overview of some of their updates featuring native histograms and OpenTelemetry support. Rewrote the underlying model, so breaking changes with the migration module for Prom simpleclient metrics JavaDoc can be found here. Almost as simple as importing changes in your Java app to use; going to update my workshop Java example for instrumentation to the new API Includes good examples in the project Exposes native + classic histograms by default, scraper's choice A lot more configurations available as Java properties Callback metrics (this is great for writing exporters) OTel push support (on a configurable interval) Allows standard OTel names (with dots), automatically replaces dots with underscores for Prometheus format Integrates with OTel tracing client to make exemplars work - picks exemplars from tracing context, extends tracing context to mark that trace to not get sampled away Despite supporting OTel, this is still a performance-minded client library All metric types support concurrent updates Dropped Pushgateway support for now, but will port it forward Once JMX exporter is updated, as a side effect, you can update Not aiming to become a full OTel library, only future-proofing your instrumentation; more lightweight and performance-focused Lightning Talks Summary Again, here is a list of lightning talks I thought were interesting from the final day and a short summary, should you want to find them in the recordings: Tracking object storage costs Trying to measure object storage costs, as they are the number 2 cost in their cloud bills; built a Prometheus Price Exporter Object storage cost is ~half of Grafana's cloud bill; varies by customer (can be as low as 2%) Trick for extending sparse metrics with zeroes: or on() vector(0) They have a prices exporter in the works; promised to open source it Prom operator - what's next? Tour of some more features coming in the Prometheus operator; shards autoscaling, scrape classes, support Kubernetes events, and Prometheus-agent deployment as DaemonSet Prometheus adoption stats 868k users in 2023 (up from 774k last year), based on Grafana instances which have at least one Prometheus data source enabled Final Impressions Final impressions of this event left me for the second straight year with the feeling that the attendees were both passionate and knowledgeable about the metrics monitoring tooling around the Prometheus ecosystem. This event did not really have "getting started" sessions. Most of this assumes you are coming for in-depth dives into the various elements of the Prometheus project, almost giving you glimpses into the research progress behind features being improved in the coming versions of Prometheus. It remains well worth your time if you are active in the monitoring world, even if you are not using open source or Prometheus: you will gain insights into the status of features in the monitoring world.
This data warehousing use case is about scale. The user is China Unicom, one of the world's biggest telecommunication service providers. Using Apache Doris, they deploy multiple petabyte-scale clusters on dozens of machines to support their 15 billion daily log additions from their over 30 business lines. Such a gigantic log analysis system is part of their cybersecurity management. For the need of real-time monitoring, threat tracing, and alerting, they require a log analytic system that can automatically collect, store, analyze, and visualize logs and event records. From an architectural perspective, the system should be able to undertake real-time analysis of various formats of logs, and of course, be scalable to support the huge and ever-enlarging data size. The rest of this article is about what their log processing architecture looks like and how they realize stable data ingestion, low-cost storage, and quick queries with it. System Architecture This is an overview of their data pipeline. The logs are collected into the data warehouse, and go through several layers of processing. ODS: Original logs and alerts from all sources are gathered into Apache Kafka. Meanwhile, a copy of them will be stored in HDFS for data verification or replay. DWD: This is where the fact tables are. Apache Flink cleans, standardizes, backfills, and de-identifies the data, and write it back to Kafka. These fact tables will also be put into Apache Doris, so that Doris can trace a certain item or use them for dashboarding and reporting. As logs are not averse to duplication, the fact tables will be arranged in the Duplicate Key model of Apache Doris. DWS: This layer aggregates data from DWD and lays the foundation for queries and analysis. ADS: In this layer, Apache Doris auto-aggregates data with its Aggregate Key model, and auto-updates data with its Unique Key model. Architecture 2.0 evolves from Architecture 1.0, which is supported by ClickHouse and Apache Hive. The transition arised from the user's needs for real-time data processing and multi-table join queries. In their experience with ClickHouse, they found inadequate support for concurrency and multi-table joins, manifested by frequent timeouts in dashboarding and OOM errors in distributed joins. Now let's take a look at their practice in data ingestion, storage, and queries with Architecture 2.0. Real-Case Practice Stable Ingestion of 15 Billion Logs Per Day In the user's case, their business churns out 15 billion logs every day. Ingesting such data volume quickly and stably is a real problem. With Apache Doris, the recommended way is to use the Flink-Doris-Connector. It is developed by the Apache Doris community for large-scale data writing. The component requires simple configuration. It implements Stream Load and can reach a writing speed of 200,000~300,000 logs per second, without interrupting the data analytic workloads. A lesson learned is that when using Flink for high-frequency writing, you need to find the right parameter configuration for your case to avoid data version accumulation. In this case, the user made the following optimizations: Flink Checkpoint: They increase the checkpoint interval from 15s to 60s to reduce writing frequency and the number of transactions processed by Doris per unit of time. This can relieve data writing pressure and avoid generating too many data versions. Data Pre-Aggregation: For data of the same ID but comes from various tables, Flink will pre-aggregate it based on the primary key ID and create a flat table, in order to avoid excessive resource consumption caused by multi-source data writing. Doris Compaction: The trick here includes finding the right Doris backend (BE) parameters to allocate the right amount of CPU resources for data compaction, setting the appropriate number of data partitions, buckets, and replicas (too much data tablets will bring huge overheads), and dialing up max_tablet_version_num to avoid version accumulation. These measures together ensure daily ingestion stability. The user has witnessed stable performance and low compaction score in Doris backend. In addition, the combination of data pre-processing in Flink and the Unique Key model in Doris can ensure quicker data updates. Storage Strategies to Reduce Costs by 50% The size and generation rate of logs also impose pressure on storage. Among the immense log data, only a part of it is of high informational value, so storage should be differentiated. The user has three storage strategies to reduce costs. ZSTD (ZStandard) compression algorithm: For tables larger than 1TB, specify the compression method as "ZSTD" upon table creation, it will realize a compression ratio of 10:1. Tiered storage of hot and cold data: This is supported by a new feature of Doris. The user sets a data "cooldown" period of 7 days. That means data from the past 7 days (namely, hot data) will be stored in SSD. As time goes by, hot data "cools down" (getting older than 7 days), it will be automatically moved to HDD, which is less expensive. As data gets even "colder," it will be moved to object storage for much lower storage costs. Plus, in object storage, data will be stored with only one copy instead of three. This further cuts down costs and the overheads brought by redundant storage. Differentiated replica numbers for different data partitions: The user has partitioned their data by time range. The principle is to have more replicas for newer data partitions and less for the older ones. In their case, data from the past 3 months is frequently accessed, so they have 2 replicas for this partition. Data that is 3~6 months old has two replicas, and data from 6 months ago has one single copy. With these three strategies, the user has reduced their storage costs by 50%. Differentiated Query Strategies Based on Data Size Some logs must be immediately traced and located, such as those of abnormal events or failures. To ensure real-time response to these queries, the user has different query strategies for different data sizes: Less than 100G: The user utilizes the dynamic partitioning feature of Doris. Small tables will be partitioned by date and large tables will be partitioned by hour. This can avoid data skew. To further ensure balance of data within a partition, they use the snowflake ID as the bucketing field. They also set a starting offset of 20 days, which means data of the recent 20 days will be kept. In this way, they find the balance point between data backlog and analytic needs. 100G~1T: These tables have their materialized views, which are the pre-computed result sets stored in Doris. Thus, queries on these tables will be much faster and less resource-consuming. The DDL syntax of materialized views in Doris is the same as those in PostgreSQL and Oracle. More than 100T: These tables are put into the Aggregate Key model of Apache Doris and pre-aggregate them. In this way, we enable queries of 2 billion log records to be done in 1~2s. These strategies have shortened the response time of queries. For example, a query of a specific data item used to take minutes, but now it can be finished in milliseconds. In addition, for big tables that contain 10 billion data records, queries on different dimensions can all be done in a few seconds. Ongoing Plans The user is now testing with the newly added inverted index in Apache Doris. It is designed to speed up full-text search of strings as well as equivalence and range queries of numerics and datetime. They have also provided their valuable feedback about the auto-bucketing logic in Doris: Currently, Doris decides the number of buckets for a partition based on the data size of the previous partition. The problem for the user is, most of their new data comes in during daytime, but little at nights. So in their case, Doris creates too many buckets for night data but too few in daylight, which is the opposite of what they need. They hope to add a new auto-bucketing logic, where the reference for Doris to decide the number of buckets is the data size and distribution of the previous day. They've come to the Apache Doris community and we are now working on this optimization.
Your team celebrates a success story where a trace identified a pesky latency issue in your application's authentication service. A fix was swiftly implemented, and we all celebrated a quick win in the next team meeting. But the celebrations are short-lived. Just days later, user complaints surged about a related payment gateway timeout. It turns out that the fix we made did improve performance at one point but created a situation in which key information was never cached. Other parts of the software react badly to the fix, and we need to revert the whole thing. While the initial trace provided valuable insights into the authentication service, it didn’t explain why the system was built in this way. Relying solely on a single trace has given us a partial view of a broader problem. This scenario underscores a crucial point: while individual traces are invaluable, their true potential is unlocked only when they are viewed collectively and in context. Let's delve deeper into why a single trace might not be the silver bullet we often hope for and how a more holistic approach to trace analysis can paint a clearer picture of our system's health and the way to combat problems. The Limiting Factor The first problem is the narrow perspective. Imagine debugging a multi-threaded Java application. If you were to focus only on the behavior of one thread, you might miss how it interacts with others, potentially overlooking deadlocks or race conditions. Let's say a trace reveals that a particular method, fetchUserData(), is taking longer than expected. By optimizing only this method, you might miss that the real issue is with the synchronized block in another related method, causing thread contention and slowing down the entire system. Temporal blindness is the second problem. Think of a Java Garbage Collection (GC) log. A single GC event might show a minor pause, but without observing it over time, you won't notice if there's a pattern of increasing pause times indicating a potential memory leak. A trace might show that a Java application's response time spiked at 2 PM. However, without looking at traces over a longer period, you might miss that this spike happens daily, possibly due to a scheduled task or a cron job that's putting undue stress on the system. The last problem is related to that and is the context. Imagine analyzing the performance of a Java method without knowing the volume of data it's processing. A method might seem inefficient, but perhaps it's processing a significantly larger dataset than usual. A single trace might show that a Java method, processOrders(), took 5 seconds to execute. However, without context, you wouldn't know if it was processing 50 orders or 5,000 orders in that time frame. Another trace might reveal that a related method, fetchOrdersFromDatabase(), is retrieving an unusually large batch of orders due to a backlog, thus providing context to the initial trace. Strength in Numbers Think of traces as chapters in a book and metrics as the book's summary. While each chapter (trace) provides detailed insights, the summary (metrics) gives an overarching view. Reading chapters in isolation might lead to missing the plot, but when read in sequence and in tandem with the summary, the story becomes clear. We need this holistic view. If individual traces show that certain Java methods like processTransaction() are occasionally slow, grouped traces might reveal that these slowdowns happen concurrently, pointing to a systemic issue. Metrics, on the other hand, might show a spike in CPU usage during these times, indicating that the system might be CPU-bound during high transaction loads. This helps us distinguish between correlation and causation. Grouped traces might show that every time the fetchFromDatabase() method is slow, the updateCache() method also lags. While this indicates a correlation, metrics might reveal that cache misses (a specific metric) increase during these times, suggesting that database slowdowns might be causing cache update delays, establishing causation. This is especially important in performance tuning. Grouped traces might show that the handleRequest() method's performance has been improving over several releases. Metrics can complement this by showing a decreasing trend in response times and error rates, confirming that recent code optimizations are having a positive impact. I wrote about this extensively in a previous post about the Tong motion needed to isolate an issue. This motion can be accomplished purely through the use of observability tools such as traces, metrics, and logs. Example Observability is somewhat resistant to examples. Everything I try to come up with feels a bit synthetic and unrealistic when I examine it after the fact. Having said that, I looked at my modified version of the venerable Spring Pet Clinic demo using digma.ai. Running it showed several interesting concepts taken by Digma. Probably the most interesting feature is the ability to look at what’s going on in the server at this moment. This is an amazing exploratory tool that provides a holistic view of a moment in time. But the thing I want to focus on is the “Insights” column on the right. Digma tries to combine the separate traces into a coherent narrative. It’s not bad at it, but it’s still a machine. Some of that value should probably still be done manually since it can’t understand the why, only the what. It seems it can detect the venerable Spring N+1 problem seamlessly. But this is only the start. One of my favorite things is the ability to look at tracing data next to a histogram and list of errors in a single view. Is performance impacted because there are errors? How impactful is the performance on the rest of the application? These become questions with easy answers at this point. When we see all the different aspects laid together. Magical APIs The N+1 problem I mentioned before is a common bug in Java Persistence API (JPA). The great Vlad Mihalcea has an excellent explanation. The TL;DR is rather simple. We write a simple database query using ORM. But we accidentally split the transaction, causing the data to be fetched N+1 times, where N is the number of records we fetch. This is painfully easy to do since transactions are so seamless in JPA. This is the biggest problem in “magical” APIs like JPA. These are APIs that do so much that they feel like magic, but under the hood, they still run regular old code. When that code fails, it’s very hard to see what goes on. Observability is one of the best ways to understand why these things fail. In the past, I used to reach out to the profiler for such things, which would often entail a lot of work. Getting the right synthetic environment for running a profiling session is often very challenging. Observability lets us do that without the hassle. Final Word Relying on a single individual trace is akin to navigating a vast terrain with just a flashlight. While these traces offer valuable insights, their true potential is only realized when viewed collectively. The limitations of a single trace, such as a narrow perspective, temporal blindness, and lack of context, can often lead developers astray, causing them to miss broader systemic issues. On the other hand, the combined power of grouped traces and metrics offers a panoramic view of system health. Together, they allow for a holistic understanding, precise correlation of issues, performance benchmarking, and enhanced troubleshooting. For Java developers, this tandem approach ensures a comprehensive and nuanced understanding of applications, optimizing both performance and user experience. In essence, while individual traces are the chapters of our software story, it's only when they're read in sequence and in tandem with metrics that the full narrative comes to life.
If your environment is like many others, it can often seem like your systems produce logs filled with a bunch of excess data. Since you need to access multiple components (servers, databases, network infrastructure, applications, etc.) to read your logs — and they don’t typically have any specific purpose or focus holding them together — you may dread sifting through them. If you don’t have the right tools, it can feel like you’re stuck with a bunch of disparate, hard-to-parse data. In these situations, I picture myself as a cosmic collector, gathering space debris as it floats by my ship and sorting the occasional good material from the heaps of galactic material. Though it can feel like more trouble than it’s worth, sorting through logs is crucial. Logs hold many valuable insights into what’s happening in your applications and can indicate performance problems, security issues, and user behavior. In this article, we’re going to take a look at how logging can help you make sense of your log data without much effort. We'll talk about best practices and habits and use some of the Log Analytics tools from Sumo Logic as examples. Let’s blast off and turn that cosmic trash into treasure! The Truth Is Out There: Getting Value Just From the Things You’re Already Logging One massive benefit offered by a log analytics platform to any system engineer is the ability to utilize a single log interface. Rather than needing to SSH into countless machines or download logs and parse through them manually, viewing all your logs in a centralized aggregator can make it much easier to see simultaneous events across your infrastructure. You’ll also be able to clearly follow the flow of data and requests through your stack. Once you see all your logs in one place, you can tap into the latent value of all that data. Of course, you could make your own aggregation interface from scratch, but often, log aggregation tools provide a number of extra features that are worth the additional investment. Those extra features include capabilities such as powerful search and fast analytics. Searching Through the Void: Using Search Query Language To Find Things You’ve probably used grep or similar tools for searching through your logs, but for real power, you need the ability to search across all of your logs in one interface. You may have even investigated using the ELK stack on your own infrastructure to get going with log aggregation. If you have, you know how valuable putting logs all in the same place can be. Some tools provide even more functionality on top of this interface. For example, with Log Analytics, you can use a Search Query Language that allows for more complex searches. Because these searches are being executed across a vast amount of log data, you can use special operations to harness the power of your log aggregation service. Some of these operations can be achieved with grep, so long as you have all of the logs at your disposal. But others, such as aggregate operators, field expressions, or transaction analytics tools, can produce extremely powerful reports and monitoring triggers across your infrastructure. To choose just one tool as an example, let’s take a closer look at field expressions. Essentially, field expressions allow you to create variables in your queries based on what you find through your log data. For example, if you wanted to search across your logs, and you know your log lines follow the format “From Jane To: John,” you can parse out the “from” and “to” with the following query: * | parse "From: * To: *" as (from, to) This would store “Jane” in the “from” field and “John” in the “to” field. Another valuable language feature you could tap into would be keyword expressions. You could use this query to search across your logs for any instances where a command with root privileges failed: (``su`` OR ``sudo`` ) AND (fail* OR error) Here is a listing of General Search Examples that are drawn from parsing a single Apache log message. Light-Speed Analytics: Making Use of Real-Time Reports and Advanced Analytics One other aspect of searching is that it’s typically looking into the past. Sometimes, you need to be looking at things as they happen. Let’s take a look at Live Tail and LogReduce — two tools to improve simple searches. Versions of these features exist on many platforms, but I like the way they work on Sumo Logic’s offering, so we’ll dive into them. Live Tail At its simplest, Live Tail lets you see a live feed of your log messages. It’s like running tail-f on any one of your servers to see the logs as they come in, but instead of being on a single machine, you’re looking across all logs associated with a Sumo Logic Source or Collector. Your Live Tail can be modified to automatically filter for only specific things. Live Tails also supports highlighting keywords (up to eight of them) as the logs roll in. LogReduce LogReduce gives you more insight into–or a better understanding of–your search query’s aggregate log results. When you run LogReduce on a query, it performs fuzzy logic analysis on messages meeting the search criteria you defined and then provides you with a set of “Signatures” that meet your criteria. It also gives you a count of the logs with that pattern and a rating of the relevance of the pattern when compared to your search. You then have tools at your disposal to rank the generated signatures and even perform further analysis on the log data. This is all pretty advanced and can be hard to understand without a demo, so you can dive deeper by watching this video. Integrated Log Aggregation Often, you’ll need information from systems you aren’t running directly mixed in with your other logs. That’s why it’s important to make sure you can integrate your log aggregator with other systems. Many log aggregators provide this functionality. Elastic, which underlies the ELK stack, provides a bunch of integrations that you can hook into your self-hosted or cloud-hosted stack. Of course, integrations aren’t only available on the ELK stack. Sumo Logic also provides a whole list of integrations as well. Regardless, the power of connecting your logs with the many systems you use outside of your monitoring and operational stack is phenomenal. Want to get logs sent from your company’s 1Password account into the rest of your logs? Need more information from AWS than you are getting on your individual instances or services? ELK and Sumo Logic provide great options. The key to understanding this concept is that you don’t need to be the one controlling the logs to make it valuable to aggregate them. Think through the full picture of what systems keep your business running, and consider putting all of the logs in your aggregator together. Conclusion This has been a brief tour through some of the features available with log aggregation. There’s a lot more to it, which shouldn’t be surprising given the vast amount of data generated every second by our infrastructure. The really amazing part of these tools is that these insights are available to you without installing anything on your servers. You just need to have a way to export your log data to the aggregation service. Whether you need to track compliance or monitor the reliability of your services, log aggregation is an incredibly powerful tool that can let you unlock infinite value from your already existing log data. That way, you can become a better cosmic junk collector!
Site Reliability Engineering,
Eric D. Schabell
Director Technical Marketing & Evangelism,
Director of Open Source Development,