Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
Building performant services and systems is at the core of every business. Tons of technologies emerge daily, promising capabilities that help you surpass your performance benchmarks. However, production environments are chaotic landscapes that exact a heavy performance toll when not maintained and monitored. Although Kubernetes is the defacto choice for container orchestration, many organizations fail to implement it. Growing organizations, in the process of upscaling their services, unintentionally introduce complexities into the system. Knowing how the infrastructure is set up and how clusters operate and communicate are crucial. Most of the infrastructural setup is tranched into a network of systems to communicate and share the workloads. If only we could visually see how the systems are connected and the underlying factors. Mapping the network using an efficient tool for visualization and assessment is essential for monitoring and maintaining services. Introduction To Visual Network Mapping Network mapping is the process of identifying and cataloging all the devices and connections within a network. A visual network map is a graphical representation of the network that displays the devices and the links between them. Visual network maps can provide a comprehensive understanding of a network's topology and identify potential problems or bottlenecks, allowing for modifications and expansion plans that can significantly improve troubleshooting, planning, analysis, and monitoring. Open-source security tools, such as OpenVAS, Nmap, and Nessus, can be used to conduct network mapping and generate visual network maps. These tools are freely available, making them a cost-effective option for organizations looking to improve their network security. Furthermore, many open-source security tools also offer active community support, enabling users to share knowledge, tips, and best practices for using the tool to its full potential. Benefits of Using Visual Network Maps An effective tool for planning and developing new networks, expanding or modernizing existing networks, and analyzing network problems or issues is a visual network map. A proper setup of visual network maps can exponentially augment the monitoring, tracking, and remediation capabilities. It can give you a clear and complete picture of the network, enabling you to pinpoint the issue’s potential source and resolve it then and there, or it can assist you in real-time network monitoring and notify you of any changes or problems beforehand. Introduction to Caretta and Grafana Caretta is an open-source network visualization and monitoring tool that enables real-time network viewing and monitoring. Grafana is an open-source data visualization and monitoring platform that enables you to create customized dashboards and alerts as well as examine and analyze data. An effective solution for comprehending and managing your network can be created by combining Caretta and Grafana. How Caretta Uses eBPF and Grafana Caretta’s reason for existence is to help you understand the topology and the relationships between devices in distributed environments. It offers various capabilities such as device discovery, real-time monitoring, alerts, notifications, and reporting. It uses Victoria Metrics to gather and publish its metrics, and any Prometheus-compatible dashboard can use the results. Carreta makes it possible to accept typical control-plane node annotations by enabling tolerations. It gathers network information, such as device and connection details, using the eBPF (extended Berkeley Packet Filter) kernel functionality and then uses the Grafana platform to present the information in a visual map. Grafana’s Role in Visualizing Caretta’s Network Maps Grafana is designed to be a modular and flexible tool that integrates and onboards a wide range of data sources and custom applications with simplicity. Due to its customizable capabilities, you can modify how the network map is presented using the Grafana dashboard. Additionally, you can pick from several visualization options to present the gathered data in an understandable and helpful way. Grafana is crucial for both showing the network data that Caretta has gathered and giving users a complete picture of the network. Using Caretta and Grafana To Create a Visual Network Map To use Caretta and Grafana for creating a visual network map, you must set up, incorporate, and configure them. The main configuration item is the Caretta daemonset. You must deploy the Caretta daemonset to the cluster of choice that will collect the network metrics into a database and set up the Grafana data source to point to the Caretta database to see the network map. Prerequisites and Requirements for Using Caretta and Grafana Caretta is a modern tool integrated with advanced features. It relies on Linux kernel version >= 4.16 and x64 bit system helm chart. Let's dive in and see how to install and configure this brilliant tool combination. Steps for Installing and Configuring Caretta and Grafana With an already pre-configured helm chart, installing Caretta is just a few calls away. The recommendation is to install Caretta in a new, unique namespace. helm repo add groundcover https://helm.groundcover.com/ helm repo update helm install caretta --namespace caretta --create-namespace groundcover/caretta The same can be applied to installing Grafana. helm install --name my-grafana --set "adminPassword=secret" \n --namespace monitoring -f custom-values.yaml stable/grafana Our custom-values.yaml will look something like below: ## Grafana configuration grafana.ini: ## server server: protocol: http http_addr: 0.0.0.0 http_port: 3000 domain: grafana.local ## security security: admin_user: admin admin_password: password login_remember_days: 1 cookie_username: grafana_admin cookie_remember_name: grafana_admin secret_key: hidden ## database database: type: rds host: mydb.us-west-2.rds.amazonaws.com ## session session: provider: memory provider_config: "" cookie_name: grafana_session cookie_secure: true session_life_time: 600 ## Grafana data persistence: enabled: true storageClass: "-" accessModes: - ReadWriteOnce size: 1Gi Configuration You can configure Caretta using helm values. Values in Helm are a chart’s setup choices. When the chart is installed, you can change the values listed in a file called values.yaml, which is part of the chart package, and customize the configurations based on the requirement at hand. An example of configuration overwriting default values is shown below: pollIntervalSeconds: 15 # set metrics polling interval tolerations: # set any desired tolerations - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule config: customSetting1: custom-value1 customSetting2: custom-value2 victoria-metrics-single: server: persistentVolume: enabled: true # set to true to use persistent volume ebpf: enabled: true # set to true to enable eBPF config: someOption: ebpf_options The pollIntervalSeconds sets the interval at which metrics are polled. In our case, we have set it to poll every 15 seconds. The tolerations section allows specifying tolerations for the pods. In the shown example, pods are allowed only to run on nodes that have the node-role.kubernetes.io/control-plane label and exist with the effect NoSchedule. The config section allows us to specify custom configuration options for the application. The victoria-metrics-single section allows us to configure the Victoria-metrics-single server. Here, it is configuring the persistent volume as enabled. The eBPF section allows us to enable eBPF and configure its options. Creating a Visual Network Map With Caretta and Grafana Caretta consists of two parts: the “Caretta Agent” and the “Caretta Server.” Every node in the cluster runs the Caretta Agent Kubernetes DaemonSet, which collects information about the cluster’s status. You will need to include the data gathered by Caretta in Grafana in order to view it as a network map and generate a visual network map. apiVersion: apps/v1 kind: DaemonSet metadata: name: caretta-depoy-test namespace: caretta-depoy-test spec: selector: matchLabels: app: caretta-depoy-test template: metadata: labels: app: caretta-depoy-test spec: containers: - name: caretta-depoy-test image: groundcover/caretta:latest command: ["/caretta"] args: ["-c", "/caretta/caretta.yaml"] volumeMounts: - name: config-volume mountPath: /caretta volumes: - name: config-volume configMap: name: caretta-config Data from the Caretta Agent is received by the Caretta Server, a Kubernetes StatefulSet, which then saves it in a database. apiVersion: apps/v1 kind: StatefulSet metadata: name: caretta-depoy-test labels: app: caretta-depoy-test spec: serviceName: caretta-depoy-test replicas: 1 selector: matchLabels: app: caretta-depoy-test template: metadata: labels: app: caretta-depoy-test spec: containers: - name: caretta-depoy-test image: groundcover/caretta:latest env: - name: DATABASE_URL value: mydb.us-west-2.rds.amazonaws.com ports: - containerPort: 80 name: http To accomplish this, you will need to create a custom data source plugin in Grafana to connect to Caretta’s data and then develop visualizations in Grafana to show that data. [datasources] [datasources.caretta] name = caretta-deploy-test type = rds url = mydb.us-west-2.rds.amazonaws.com access = proxy isDefault = true Customization Options for the Network Map and How to Access Them The network map that Caretta and Grafana produced can be customized in a variety of ways. We can customize the following: Display options: With display customization options, you have control over the layout of the map, the thickness, and the color of the connections and devices. Data options: With data options, you may select which information, including warnings, performance metrics, and details about your device and connection, is shown on the map. Alerting options: With alerting options, you can be informed of any network changes or problems, such as heavy traffic, sluggish performance, or connectivity problems. Visualization options: With visualization options, you can present the gathered data in an understandable and useful way. Usually, you’ll need to use the Grafana dashboard to access these and other customization options. Depending on the version of Caretta and Grafana you are running and your particular setup and needs, you will have access to different options and settings. Interpreting and Using the Visual Network Map The primary goals of a visual network map made with Caretta and Grafana are aiding in network topology comprehension, the identification of possible bottlenecks or problems, and the planning and troubleshooting of network problems. You must comprehend the various components of the map and what they stand for in order to interpret and use the visual network map. Some of the types of information that may be displayed on the map are: Devices: The network’s endpoints, including servers, switches, and routers, are presented on the map. Connections: The connections between devices, such as network cables, wireless connectivity, or virtual connections, and sometime the connectivity type may be depicted on the map. Data: Performance indicators, alarms, and configuration information will be displayed on the maps. Tips for Using the Network Map To Assess Performance in Your K8s Cluster Creating a curated, informative and scalable network map is more challenging than it sounds. But with a proper tool set, this becomes manageable. We have seen what we can accomplish using Caretta and Grafana together. Now, let's see what we need to consider for using network maps that showcase the performance metrics of your Kubernetes clusters. First and foremost, understand the network topology of the cluster, including the physical and virtual networks that your services run on. Next, ensure that the network plugin that you are using is compatible with your application. Finally, define network policies to secure communication between pods, control ingress, and egress traffic, monitor, and troubleshoot. Understand pod-to-pod communication and pod networking is happening. Conclusion Breaking down large systems into microservices, making systems distributed, and orchestrating them is the most followed approach to boost performance and uptime. Kubernetes and Docker are the market leaders here. As performant as it is, observability is a concern in large-scale distributed systems. We need to consider all the influencing outliers and anomalies to monitor and enhance the overall system with optimal performance in mind. New technologies make innovations and advancements easy but introduce unknown impediments to the system. You need an observability tool that can track all the network operations and present them in an efficient and informative way. Grafana is the leading tool in the monitoring space. By combining Caratta, an open-source network visualization, and monitoring tool, with Grafana, we can unlock the true value of our infrastructure.
BindPlane OP is a powerful open-source tool that makes it easy to build and manage telemetry pipelines to ship data from IT environments of any kind and size to any analysis tool or storage destination. BindPlane OP installs and configures OpenTelemetry agents, which support a wide variety of sources and can be configured to ship data to multiple destinations while enriching or reducing data simultaneously. The vendor-agnostic toolset is excellent for reducing data costs and getting the most out of your data. OpenTelemetry agent configurations are formatted as YAML files that are always editable without the need to uninstall and reinstall the agent. With BindPlane OP, you can create and edit configurations with a codeless visual interface and step-by-step options for collecting and managing your data. This article covers how to create and edit a configuration in BindPlane OP, add sources and destinations, add processors to reduce or enrich data, and apply that configuration to one or many agents with a few clicks. What Are Configs in BindPlane OP? Configurations in BindPlane OP are instructions for directing OpenTelemetry agents. The Configs tab is where you create, save, and apply agent configurations in BindPlane OP. Existing configurations appear in a sortable, searchable list, and new configurations can be created by clicking the “New Configuration” button at the top. Clicking on a configuration will open a page where you can see details for that configuration, including the sources, destinations, and a topographical map representing the real time data flow and processing of agents using that configuration. You can edit the sources and destinations by clicking them, and add new ones by clicking the respective buttons. New configurations created on the Configs page exist as templates that do not affect your agents unless explicitly applied, so don’t worry about affecting your environment while creating or editing new configurations, but if you edit an existing configuration that is already applied to agents in your environment, those changes will be pushed to the appropriate agents when you click “Save” after an edit is made. At the bottom of any configuration’s page, you can see a searchable list of agents using that configuration, their operating system, and their connection status. You can also apply the configuration to new agents by clicking the “+” button at the top of this section. Creating an OpenTelemetry Agent Configuration in BindPlane OP Creating a configuration is a three-step process. Give your configuration a name, select your intended operating system, and write a brief description. It’s good practice to give enough detail in the description so that other users can quickly identify the intended use of the configuration. Add your intended source(s). Sources appear in a searchable list with “log,” “metrics,” and “traces” labels. You can adjust your basic settings for the source, like the hostname and port. You can also adjust advanced settings, such as the collection interval. Advanced settings vary by source. Add your destination(s). Destinations appear in a searchable list. Click a destination to select it. Give your destination a name for use in BindPlane OP (the name does not have to match anything outside of BindPlane OP, but that can make organization easier) Add the necessary routing and authentication information for the destination. The specific requirements vary by destination, but typically include a region and an access key or password. Tooltips in BindPlane OP can help you find your authentication information if you don’t have it. You can add as many destinations as you wish by repeating the previous steps. OpenTelemetry agents can ship data to any number of destinations. Once you add a destination to a configuration, it is saved as a component, so you can skip the routing and authentication when you use it in another configuration. Adding Processors to Enrich Data and Reduce Data Costs Processors are components in configurations that manipulate data at the source before it’s sent to your destinations. There are many ways data can be manipulated, but the two general categories of data manipulation are enrichment and reduction. Enrichment means you add additional information to data at the source that helps you use the data more effectively at your destination. Reduction means you filter or eliminate data to reduce data flow and save on data costs. Below are the steps for adding a processor, and list of processors and their uses. On the Configs page, click on the configuration you want to add a processor to, then click on the source you want the processor to affect data from. Click “Add Processor” at the bottom of the source details. Select the processor you want from the list, or create a custom processor. Input the details for how you want the processor to manipulate your data, then click “Save.” The processor will be applied to any agents using the configuration automatically. Processor Description Add Attribute The Add Log Record Attribute processor can be used to enrich logs by adding attributes to all log records in the pipeline. Add Resource The Add Resource Attribute processor can be used to enrich telemetry by adding resources to all metrics, traces, and logs in the pipeline. Custom Processor The Custom processor can be used to inject a custom processor configuration into a pipeline.The Custom processor is useful for solving use cases not covered by BindPlane OP's other processor types. Delete Attribute The Delete Log Record Attribute processor can be used to remove attributes from alllog records in the pipeline. Delete Resource The Delete Resource processor can be used to remove resources from metrics, traces, and logs. Filter Severity The Severity Filter processor can be used to filter out logs that do not meet a given severity threshold. Filter Log Record Attribute The Log Record Attribute Filter processor can be used to include or exclude logs based on matched attributes. Filter Resource Attribute The Resource Attribute Filter processor can be used to include or exclude logs based on matched resources. Filter Metric Name The Metric Name Filter processor can be used to include or exclude metrics based on their name. Log Sampling The Log Sampling processor can be used to filter out logs with a configured "drop ratio." Extract Metric The Extract Metric Processor can look at all logs matching a filter, extract a numerical value from a field, and then create a metric with that value.Both the name and units of the created metric can be configured. Additionally, fields from matching logs can be preserved as metric attributes. Count Logs The Count Logs Processor can count the number of logs matching some filter, and create a metric with that value.Both the name and units of the created metric can be configured. Additionally, fields from matching logs can be preserved as metric attributes.
DevOps, as initially conceived, was more of a philosophy than a set of practices—and it certainly wasn't intended to be a job title or a role spec. Yet today, DevOps engineers, site reliability engineers, cloud engineers, and platform engineers are all in high demand—with overlapping skillsets and with recruiters peppering role descriptions with liberal sprinklings of loosely related keywords such as "CI/CD pipeline," "deployment engineering," "cloud provisioning," and "Kubernetes." When I co-founded Kubiya.ai, my investors pushed me to define my target market better. For example, was it just DevOps or also SREs, cloud and platform engineers, and other end users? More recently, I'm seeing lots of interest from job seekers and recruiters in defining these roles. From Reddit posts to webinars, this is a hotly debated topic. In this article, I offer my thoughts but recognize there's a great deal of room for interpretation. This is an inflammatory topic for many—so at the risk of provoking a conflagration, let's proceed! The Proliferation in DevOps Job Specs The practice of DevOps evolved in the 2000s to address the need to increase release velocity and reduce product time to market while maintaining system stability. In addition, service-oriented architectures allowed separate developer teams to work independently on individual services and applications, enabling faster prototyping and iteration than ever before. The traditional tension between a development team focused on software release and a separate, distinct operations team focused on system stability and security grew. This hindered the pace that many businesses aspired to. In addition, devs didn't always properly understand operational requirements, while ops couldn't head off performance problems before they had arisen. The DevOps answer was to break down silos and encourage greater collaboration facilitated by tooling, cultural change, and shared metrics. Developers would own what they built—they would be able to deploy, monitor, and resolve issues end to end. Operations would better understand developer needs; get involved earlier in the product lifecycle; and provide the education, tools, and guardrails to facilitate dev self-service. DevOps, as initially conceived, was more of a philosophy than a prescriptive set of practices—so much so that there isn't even common agreement on the number and nature of these practices. Some cite the "four pillars of DevOps," some the "five pillars," and some the six, seven, eight, or nine. You can take your pick. Different organizations have implemented DevOps differently (and many have not at all). And here, we can anticipate the job spec pickle we've found ourselves in. As Patrick Debois, founder of DevOpsDays, noted, "It was good and bad not to have a definition. People… are really struggling with what DevOps is right now. But, on the other hand, not writing everything down meant that it evolved in many directions." The one thing that DevOps was not was a role specification. Fast forward to today, and numerous organizations are actively recruiting for "DevOps Engineers." Worse still, there is very little clarity on what one is—with widely differing skill sets sought from one role to the next. Related and overlapping roles such as "site reliability engineer," "platform engineer," and "cloud engineer" are muddying already dim waters. How did we get here, and what—if any—are the real differences between these roles? DevOps and DevOps Anti-Types In my experience, realizing DevOps as it was originally conceived—i.e., optimally balancing specialization with collaboration and sharing—has been challenging for many organizations. Puppet's 2021 State of DevOps report found that only 18% of respondents identify themselves as "highly evolved" practitioners of DevOps. And as the team at DevOps Topologies describe, some of these benefits come from special circumstances. For example, organizations such as Netflix and Facebook arguably have a single web-based product, which reduces the variation between product streams that can force dev and ops further apart. Others have imposed strict collaboration conditions and criteria—such as the SRE teams of Google (more on that later!), who also wield power to reject software that endangers system performance. Many of those at a lower level of DevOps evolution struggle to fully realize the promise of DevOps, owing to organizational resistance to change, skills shortages, lack of automation, or legacy architectures. As a result, a wide range of different DevOps implementation approaches will have been adopted across this group, including some of the DevOps "anti-types" described by DevOps Topologies. For many, dev and ops will still be siloed. For others, DevOps will be a tooling team sitting within the development and working on deployment pipelines, configuration management, and such, but still in isolation from ops. And for others, DevOps will be a simple rebranding of SysAdmin, with DevOps engineers hired into ops teams with expanded skillset expectations but with no real cultural change taking place. The rapid adoption of public cloud usage has also fueled belief in the promise of a self-service DevOps approach. But being able to provision and configure infrastructure on-demand is a far cry from enabling devs to deploy and run apps and services end to end. Unfortunately, not all organizations understand this, and so automation for many has stalled at the level of infrastructure automation and configuration management. With so many different incarnations of DevOps, it's no wonder there's no clear definition of a DevOps role spec. For one organization, it might be synonymous only with the narrowest of deployment engineering—perhaps just creating CI/CD pipelines—while at the other end of the spectrum, it might essentially be a rebranding of ops, with additional skills in writing infrastructure as code, deployment automation, and internal tooling. For others, it can be any shade of gray in between, so here we are with a bewildering range of DevOps job listings. SRE, Cloud Engineer, and Platform Engineer—Teasing Apart the Roles So depending on the hiring organization, for better or worse, a DevOps Engineer can be anything from entirely deployment-focused to a more modern variation of a SysAdmin. What about the other related roles: SREs, cloud engineers, and platform engineers? Here's my take on each: Site Reliability Engineer The concept of SRE was developed at Google by Ben Traynor, who described it as "what you get when you treat operations as a software problem and you staff it with software engineers." The idea was to have people who combine operations skills and software development skills to design and run production systems. The definition of service reliability SLAs is central and ensures that dev teams provide evidence up front that software meets strict operational criteria before being accepted for deployment. In addition, SREs, strive to make infrastructure systems more scalable and maintainable, including—to that end—designing and running standardized CI/CD pipelines and cloud infrastructure platforms for developer use. As you can see, there's a strong overlap with how some would define a DevOps engineer. So perhaps one way of thinking about the difference is that. In contrast, DevOps originated with the aim of increasing release velocity, and SREs evolved from the objective of building more reliable systems in the context of growing system scale and product complexity. So to some extent, the two have met in the middle. Cloud Engineer As the functionality of the cloud has grown, some organizations have created dedicated roles for cloud engineers. Again, although there are no hard and fast rules, cloud engineers are typically focused on deploying and managing cloud infrastructure and know how to build environments for cloud-native apps. They'll be experts in AWS/Azure/Google Cloud Platform. Depending on the degree of overlap with DevOps engineer responsibilities, they may also be fluent in Terraform, Kubernetes, etc. With the forward march of cloud adoption, cloud engineer roles are subsuming what formerly might have been called an infrastructure engineer, with its original emphasis on both cloud and on-premises infrastructure management. Platform Engineer Internal developer platforms (IDPs) have emerged as a more recent solution to cutting the Gordian knot of balancing developer productivity with system control and stability. Platform engineers design and maintain IDPs that aim to provide developers with self-service capabilities to independently manage the operational aspects of the entire application lifecycle—from CI/CD workflows; to infrastructure provisioning and container orchestration; to monitoring, alerting, and observability. Many devs simply don't want to do ops—at least not in the traditional sense. As a creative artist, the developer doesn't want to worry about how infrastructure works. So, crucially, the platform is conceived of as a product, achieving control by creating a compelling self-serve developer experience rather than by imposing mandated standards and processes. Getting Comfortable With Dev and Ops Ambiguity So, where does this leave candidates for all these various roles? Probably for now—and at least until there is a greater commonality of DevOps implementation approaches—the only realistic answer is to make sure you ask everything you need to during an interview clarifying both the role expectations and the organizational context into which you will be hired. For recruiters, you may decide for various reasons to cast a wide net, stuffing job postings with trending keywords. But ultimately, the details about a candidate's experience and capabilities must come out in the interview process and conversations with references. From my perspective, whether you are a DevOps, Platform Engineer, Cloud Engineer, or even an SRE, ensuring you are supporting developers with all their operational needs will go a long way in helping them focus on creating the next best thing.
In this sixth installment of the series covering my journey into the world of cloud-native observability, I'm going to start diving into an open-source project called Perses. If you missed any of the previous articles, head on back to the introduction for a quick update. After laying out the groundwork for this series in the initial article, I spent some time in the second article sharing who the observability players are. I also discussed the teams that these players are on in this world of cloud-native o11y. For the third article, I looked at the ongoing discussion around monitoring pillars versus phases. In the fourth article, I talked about keeping your options open with open-source standards. In my last installment, the fifth article in this series, I talked about bringing monolithic applications into the cloud native o11y world. Being a developer from my early days in IT, it's been very interesting to explore the complexities of cloud-native o11y. Monitoring applications goes way beyond just writing and deploying code, especially in the cloud-native world. One thing remains the same: maintaining your organization's architecture always requires both a vigilant outlook and an understanding of available open standards. In this sixth article, I'm going to provide you with an introduction to an up-and-coming open-source metrics dashboard project I'm getting involved in. Not only will this article provide an introduction to the project, but I'm going to get you started hands-on with a workshop I'm developing to get started with dashboards and visualization. This article is my start at getting practical hands-on experience in the cloud-native o11y world. I've chosen to start with the rather new, up-and-coming open-source project Perses. Not only am I exploring this project, but as I learn I am sharing this knowledge in a free online workshop that you can follow as it's developed here: Now let's explore the origins of this new project. The Origins of Perses Perses is the first project under the CoreDash community umbrella which is part of the Linux Foundation. It's a centralized effort to define a standard for visualization and dashboards. Perses is licensed under the Apache License 2.0, which is a big difference from the recent changes to what used to be the default dashboard project before they opted to change to Affero General Public License (AGPL) v3. This change means users that apply any modification have to share them back into the project, which is a bit more restrictive than most users want. Its main goal is becoming an exploration into finding an open-source standard for visualization and dashboards for metrics monitoring. Its first code commit was made in January 2022, and since then has been quite active. There are clear project goals: Become an open standard dashboard visualization tool. Provide embeddable charts and dashboards in any user interface. Target Kubernetes (k8s) native mode. Provide complete static validation for CI/CD pipelines. Architecture supporting future plugins Time will tell if these goals can be met, but you can check out a more in-depth introduction in this free online workshop lab 1: Next, we can look at installation options for this project Installing Perses There are two options for installing Perses on your local machine. For the first option, you can build the source code project and run it from there, but there are a few software dependencies that you need to meet first to do that. Second, if the bar is too high to build the project from its source, you can install and run Perses from a container image. I've put together a simple supporting project that you can use called Perses Easy Install project. This project contains a simple installation script that allows you to choose either a container install using Podman, or to build the project from its source code. Both methods include sanity checks for the dependencies on your machine before allowing you to install the project. Install in a Container (Podman) This is an installation using the provided Perses container image. You will run this container on a virtual machine provided by Podman. Prerequisites: Podman 4.x+ with your Podman machine started Download and unzip the demo (see project README for links to this download). Run 'init.sh' with the correct argument: $ podman machine init $ ./init.sh podman 3. The Perses container image is now running and pre-loaded with demo examples, connect in your browser: http://localhost:8080 For an installation from the source, the following process is needed. Install on Your Local Machine This is an installation from the source code of the Perses project. You will test, build, and deploy the Perses server locally on your machine. Prerequisites: Go version 1.18+, NodeJS version 16+, npm version 8+ Download and unzip the demo as linked in step 1 above (see project README for links to this download). Run 'init.sh' with the correct argument: $ ./init.sh source 3. Perses is now running, connect to the Perses dashboards in your browser: http://localhost:8080 For step-by-step instructions on how to install Perses using a container image or from the source code in the project itself, see this free online workshop lab 2: More To Come Next up, I plan to continue working through the Perses project with more workshop materials to share. Stay tuned for more insights into a real, practical experience as my cloud native o11y journey continues.
Monitoring QuestDB in Kubernetes As any experienced infrastructure operator will tell you, monitoring and observability tools are critical for supporting production cloud services. Real-time analytics and logs help to detect anomalies and aid in debugging, ultimately improving the ability of a team to recover from (and even prevent) incidents. Since container technologies are drastically changing the infrastructure world, new tools are constantly emerging to help solve these problems. Kubernetes and its ecosystem have addressed the need for infrastructure monitoring with a variety of newly emerging solutions. Thanks to the orchestration benefits that Kubernetes provides, these tools are easy to install, maintain, and use. Luckily, QuestDB is built with these concerns in mind. From the presence of core database features to the support for orchestration tooling, QuestDB is easy to deploy on containerized infrastructure. This tutorial will describe how to use today's most popular open-source tooling to monitor your QuestDB instance running in a Kubernetes cluster. Components Our goal is to deploy a QuestDB instance on a Kubernetes cluster while also connecting it to centralized metrics and logging systems. We will be installing the following components in our cluster: A QuestDB database server Prometheus to collect and store QuestDB metrics Loki to store logs from QuestDB Promtail to ship logs to Loki Grafana to build dashboards with data from Prometheus and Loki These components work together as illustrated in the diagram below: Prerequisites To follow this tutorial, we will need the following tools. For our Kubernetes cluster, we will be using kind (Kubernetes In Docker) to test the installation and components in an isolated sandbox, although you are free to use any Kubernetes flavor to follow along. docker or podman kind kubectl jq curl Getting Started Once you've installed kind, you can create a Kubernetes cluster with the following command: Shell kind create cluster This will spin up a single-node Kubernetes cluster inside a Docker container and also modify your current kubeconfig context to point kubectl to the cluster's API server. QuestDB QuestDB Endpoint QuestDB exposes an HTTP metrics endpoint that can be scraped by Prometheus. This endpoint, on port 9003, will return a wide variety of QuestDB-specific metrics, including query, memory usage, and performance statistics. A full list of metrics can be found in the QuestDB docs. Helm Installation QuestDB can be installed using Helm. You can add the official Helm repo to your registry by running the following commands: Shell helm repo add questdb https://helm.questdb.io/ helm repo update This is only compatible with the Helm chart version 0.25.0 and higher. To confirm your QuestDB chart version, run the following command: Shell helm search repo questdb Before installing QuestDB, we need to enable the metrics endpoint. To do this, we can override the QuestDB server configuration in a values.yaml file: Shell <<EOF > questdb-values.yaml --- metrics: enabled: true EOF Once you've added the repo, you can install QuestDB in the default namespace: Shell helm install -f questdb-values.yaml questdb questdb/questdb To test the installation, you can make an HTTP request to the metrics endpoint. First, you need to create a Kubernetes port forward from the QuestDB pod to your localhost: export QUESTDB_POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=questdb,app.kubernetes.io/instance=questdb" -o jsonpath="{.items[0].metadata.name}") kubectl --namespace default port-forward $QUESTDB_POD_NAME 9003:9003 Next, make a request to the metrics endpoint: Shell curl http://localhost:9003/metrics You should see a variety of Prometheus metrics in the response: # TYPE questdb_json_queries_total counter questdb_json_queries_total 0 # TYPE questdb_json_queries_completed_total counter questdb_json_queries_completed_total 0 ... Prometheus Now that we've exposed our metrics HTTP endpoint, we can deploy a Prometheus instance to scrape the endpoint and store historical data for querying. Helm Installation Currently, the recommended way of installing Prometheus is using the official Helm chart. You can add the Prometheus chart to your local registry in the same way that we added the QuestDB registry above: Shell helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update As of this writing, we are using the Prometheus chart version 19.0.1 and app version v2.40.5 Configuration Before installing the chart, we need to configure Prometheus to scrape the QuestDB metrics endpoint. To do this, we will need to add our additional scrape configs to a prom-values.yaml file: Shell <<EOF > prom-values.yaml --- extraScrapeConfigs: | - job_name: questdb metrics_path: /metrics scrape_interval: 15s scrape_timeout: 5s static_configs: - targets: - questdb.default.svc.cluster.local:9003 EOF This config will make Prometheus scrape our QuestDB metrics endpoint every 15 seconds. Note that we are using the internal service URL provided to us by Kubernetes, which is only available to resources inside the cluster. We're now ready to install the Prometheus chart. To do so, you can run the following command: Shell helm install -f prom-values.yaml prometheus prometheus-community/prometheus It may take around a minute for the application to become responsive as it sets itself up inside the cluster. To validate that the server is scraping the QuestDB metrics, we can query the Prometheus server for a metric. First, we need to open up another port forward: Shell export PROM_POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}") kubectl --namespace default port-forward $PROM_POD_NAME 9090 Now we can run a query for available metrics after waiting for a minute or so. We are using jq to filter the output to only the QuestDB metrics: Shell curl -s http://localhost:9090/api/v1/label/__name__/values | jq -r '.data[] | select( . | contains("questdb_"))' You should see a list of QuestDB metrics returned: questdb_commits_total questdb_committed_rows_total ... Loki Metrics are only part of the application support story. We still need a way to aggregate and access application logs for better insight into QuestDB's performance and behavior. While kubectl logs is fine for local development and debugging, we will eventually need a production-ready solution that does not require the use of admin tooling. We will use Grafana's Loki, a scalable open-source solution that has tight Kubernetes integration. Helm Installation Like the other components we worked with, we will also be installing Loki using an official Helm chart, loki-stack. The loki-stack helm chart includes Loki, used as the log database, and Promtail, a log shipper that is used to populate the Loki database. First, let's add the chart to our registry: Shell helm repo add grafana https://grafana.github.io/helm-charts helm repo update Loki and Promtail are both enabled out of the box, so all we have to do is install the Helm chart without even supplying our own values.yaml. Shell helm install loki grafana/loki-stack After around a minute or two, the application should be ready to go. To test that Promtail is shipping QuestDB logs to Loki, we first need to generate a few logs on our QuestDB instance. We can do this by curling the QuestDB HTTP frontend to generate a few INFO-level logs. This is exposed on a different port than the metrics endpoint, so we need to open up another port forward first. Shell export QUESTDB_POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=questdb,app.kubernetes.io/instance=questdb" -o jsonpath="{.items[0].metadata.name}") kubectl --namespace default port-forward $QUESTDB_POD_NAME 9000:9000 Now navigate to http://localhost:9000, which should point to the QuestDB HTTP frontend. Your browser should make a request that causes QuestDB to emit a few INFO-level logs. You can query Loki to check if Promtail picked up and shipped those logs. Like the other components, we need to set up a port forward to access the Loki REST API before running the query. Shell export LOKI_POD=$(kubectl get pods --namespace default -l "name=loki,app=loki" -o jsonpath="{.items[0].metadata.name}") kubectl --namespace default port-forward $LOKI_POD 3100:3100 Now, you can run the following LogQL query against the Loki server to return these logs. By default, Loki will look for logs at most an hour old. We will also be using jq to filter the response data. Shell curl -s -G --data-urlencode 'query={pod="questdb-0"}' http://localhost:3100/loki/api/v1/query_range | jq '.data.result[0].values' You should see a list of logs with timestamps that correspond to the logs from the above sample: [ [ "1670359425100049380", "2022-12-13T20:43:45.099494Z I http-server disconnected [ip=127.0.0.1, fd=23, src=queue]" ], [ "1670359425099842047", "2022-12-13T20:43:45.099278Z I http-server scheduling disconnect [fd=23, reason=12]" ], ... Grafana Now that we have all of our observability components set up, we need an easy way to aggregate our metrics and logs into meaningful and actionable dashboards. We will install and configure Grafana inside your cluster to visualize your metrics and logs in one easy-to-use place. Helm Installation The loki-stack chart makes this very easy for us to do. We just need to enable Grafana by customizing the chart's values.yaml and upgrading it. Shell <<EOF > loki-values.yaml --- grafana: enabled: true EOF With this setting enabled, not only are we installing Grafana, but we are also registering Loki as a data source in Grafana to save us the extra work. Now we can upgrade our Loki stack to include Grafana: Shell helm upgrade -f loki-values.yaml loki grafana/loki-stack To get the admin password for Grafana, you can run the following command: Shell kubectl get secret --namespace default loki-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo And to access the Grafana front-end, you can use a port forward: Shell kubectl port-forward --namespace default service/loki-grafana 3000:80 Configuration First, navigate to http://localhost:3000 in your browser. You can log in using the username admin and the password that you obtained in the previous step. Once you've logged in, use the sidebar to navigate to the "data sources" tab: Here, you can see that the Loki data source is already registered for us: We still need to add our Prometheus data source. Luckily, Grafana makes this easy for us. Click "Add Data Source" in the upper right and select "Prometheus". From here, the only thing you need to do is enter the internal cluster URL of your Prometheus server's Service: http://prometheus-server.default.svc.cluster.local. Scroll down to the bottom, click "Save & test", and wait for the green checkmark popup in the right corner. Now you're ready to create dashboards with QuestDB metrics and logs! Conclusion I have provided a step-by-step tutorial to install and deploy QuestDB with a monitoring infrastructure in a Kubernetes cluster. While there may be additional considerations to make if you want to improve the reliability of the monitoring components, you can get very far with a setup just like this one. Here are a few ideas: Add alerting to a number of targets using Alertmanager Build interactive dashboards that combine metrics and logs using Grafana variables Configure Loki to use alternative deployment modes to improve reliability and scalability Leverage Thanos to incorporate high availability into your Prometheus deployment If you like this content, we'd love to know your thoughts! Feel free to share your feedback or just come and say hello in the QuestDB Community Slack.
Runbooks are documented procedures for the maintenance and upgrades of systems. Leverage runbooks during incident response. Save your team's invaluable time. Learn more. Need for Runbooks Imagine being an Ops engineer in a team just struck by tragedy. [sigh…] Alarms start ringing, and incident response is in full force. It may sound like the situation is in control. WRONG! There's panic everywhere. The on-call team is scrambling for the heavenly door to redemption. But, the only thing that doesn't stop– Stakeholder Inquiries. This situation is bad. But it could be worse. Now imagine being a less-experienced Ops engineer in a relatively small on-call team struck by tragedy. If you don't have sufficient guidance, let alone moral support– you're toast. If being 'on-call during an incident' is a battle, then ensuring 'things don't go as bad the next time' is a war. And the formal term in IT for this 'war' is called Site Reliability Engineering (SRE). The ability to do good SRE hinges on the ability of the response team to be 'prepared.' This is exactly what Runbooks can help SRE / response teams with. Runbooks can help teams be prepared…for just about anything. What Are Runbooks? GitLab has defined it best. Runbooks are a collection of documented procedures that explain how to carry out a particular process, be it starting, stopping, debugging, or troubleshooting a particular system. Here's another apt definition specific to incident response: A Runbook is a compilation of routine procedures and operations that are documented for reference while working on a critical incident. It doesn't necessarily need to be a critical incident. A Runbook can also be used to document the standard procedures for maintenance and upgrades. The school-boy error that most teams commit whilst documenting such procedural actions is storing them in Google Docs/ Notion/ Confluence/ other tools. Or worse, storing them in physical books. Valuable time is lost while looking for the right document instead of fixing the issue and letting things go from bad to worse. The problem lies not in the document itself but in 'where it's documented.' Hence it is recommended to store such procedural documents in a centralized location. A centralized location accessible by the entire incident response team – within their Incident Response tool. Incident response teams can thus leverage Squadcast's Runbooks for this purpose. Teams can store checklists of tasks that need to be performed manually for an incident. These checklists can be 'steps to be performed' in the event of either a SEV-1 incident or during routine checks/ maintenance/ upgrades, or they can be technical/ functional instructions to debug and fix a certain issue, along with the code that needs to be manually executed. Leveraging Runbooks is thus a means to reduce MTTA and MTTR by avoiding delays scrambling for external documents. Instead, you can use Runbooks to store procedural steps, associate them with incidents, and assign tasks to relevant users. Types of Runbooks This is a good segway into understanding the different types of Runbooks. 1. Manual Runbooks (Interchangeably Referred to as Playbooks) These are Runbooks that contain step-by-step instructions to be followed by an operator. It could include commands to be executed in the event of a certain condition. 2. Executable Runbooks These are Runbooks that comprise a combination of manual and automated steps which can be executed directly from Squadcast. While the steps and/ or commands may be documented, operators might have to manually execute them. 3. Fully-Automated Runbooks These are Runbooks that require no manual intervention and will be automatically executed based on preset conditions. They are part of workflows that get executed on the occurrence of a particular event. E.g., Automatically restart the server if the CPU spikes to 100%. Runbooks: Use Cases Now that we've established what a Runbook does let's run through some use cases and understand how a Runbook comes to the rescue of response teams. A Knowledgebase to Bounce Back From Incidents When an incident occurs, the response team is expected to quickly spring into action and restore affected services. But how are they to know what steps to take? Especially if they are inexperienced and new to the team. Even if the plan of action is as simple as having to restart a service, successful SRE teams document the necessary steps or prepare guidelines for the course of action. These could include the commands that need to be executed, names of senior personnel that need to be informed, best practices, etc. This is one such use case of leveraging Runbooks. Documenting Standard Procedures During Maintenance/ Upgrades Performing maintenance/ upgrades, daily backups, applying patches, updating the respective teams of scheduled downtimes, etc., are tasks to be performed routinely. Since such tasks require the same commands to be executed and the same course of action to be followed, response teams are better off documenting/ templatizing these steps and executing them when necessary. This is a better alternative to researching the commands every single time since unknown/ unfamiliar commands behave unexpectedly and can potentially cause downtime. By using Runbooks to store commands and procedures to be followed during maintenance, response teams can complete the tasks quickly, as well as prevent any unexpected behavior from the system. Preventing Unnecessary Escalations At times when senior devs are on leave and not reachable for providing guidance, junior devs will most likely undergo on-call stress. Instead of expecting junior devs to figure out remediation steps on their own, a better practice is to document response actions for common incidents and leverage that as a starting point for resolution. This way, the stress levels for junior devs can be alleviated, as well as preventing unnecessary escalations to senior devs for issues major and minor. Automation for Better SRE One of the core tenets of SRE is automation. Some of the key objectives for an SRE are: Codifying actions into executable code. Setting up automated checklists that improve the speed of diagnosing and resolving incidents. Reducing toil while reviewing audit logs. Be it a sequence of steps to be performed, or a set of commands to be executed, Runbooks make all of this possible. Either by setting up Runbooks to execute manually or by enabling human-in-the-loop automation, SREs can gain a ton of value by leveraging Runbooks. Conclusion Runbooks help Developers and SREs automate toil and give On-call teams access to templatized actions. The bigger the organization, the more important it becomes to implement Runbooks.
Recently, we see a special emphasis has been laid on software supply chain security referring to the recent report by Google – 2022 Accelerate State of DevOps Report. With security holding center stage, we see a good emergence of practices such as “SRE” and “DevSecOps.” Looking at the State of DevOps report from different companies such as Google, CircleCI, Puppet, and Dynatrace, we have got some common findings, such as: Continued difficulty with delivering timely business value and innovation. A fragmented toolchain. Quality sacrifices. To get more clarity behind these similar findings, let’s understand what makes a DevOps pipeline. How can we strengthen our DevOps pipeline?Before our discussion begins, organizations that want to adopt DevOps should remember: What is DevOps? What DevOps Does What goals to set for successful DevOps adoption. Why Does DevOps Matter? Today software is no longer a mere support for business but has become an integral component of every part of a business. Just the way companies have sped up the production of physical goods with industrial automation. Similarly, we now need to transform how the build and delivery of software can get better. The goal of DevOps is to increase the velocity of software delivery whilst maintaining the stability and availability of your applications. To achieve this, it has two core principles: Elimination of siloes among cross-functional teams. Bringing in Automation. What Makes DevOps? First, let’s understand what DevOps is in action in some instances. With microservices and continuous delivery, teams take ownership of services and then release their updates/patches quicker. Monitoring and logging practices help all teams stay informed of any deviations in performance in real time. This brings both the dev and ops teams together, breaking down siloes. DevOps is incomplete without automation. We need DevOps Automation to reduce human assistance and also reduced errors. This also facilitates feedback loops between dev and ops teams so that iterative updates can be deployed faster to applications in production. Now, we are looking for that one practice, approach, or tool that brings all of this together with automation. It is the Continuous Integration and Continuous Delivery (CI/CD) Pipeline. Through the CI/CD pipeline is how we omit the barriers, and the transition between dev-centric domains and ops-centric domains happens. CI/CD pipeline has become the backbone of DevOps. It brings in DevOps automation in the form of automated tests and automated deployments. Code from a faulty CI/CD pipeline deployed in production maximizes the potential for incidents and disruptions, and given its fault, it could have missed capturing any error. A failed CI/CD stage affects both velocity and stability in the DevOps pipeline. This automatically makes the CI/CD phase fragile. One such commonly noted event is “flaky tests” by developers. The condition where certain automated tests in CI may randomly fail or succeed without any actual change in the code. Differentiating the “CI” Part of the CI/CD Pipeline Continuous integration consists of the version control system, automated code building, and automated testing. It becomes the single source of truth by providing a version control system and artifact repository. The prime motive of continuous integration is to ensure the changes in the code are continuously integrated and appropriately tested. To keep up with the rapid development pace, developers tend to override these tests, ignoring all the warning signs. This actually creates vulnerable gaps in our deployed codes which obviously maximizes the chances of getting our environments corrupted.Ultimately downgrading our trust in CI processes. How to Strengthen the DevOps Pipeline? The objective of the DevOps environment is to avoid the slippery slope of unknown failures. Considering the importance of the CI phase, we need to take appropriate measures to strengthen it. Furthermore, with an increased preference for cloud microservices, it has become seemingly difficult to debug CI tests in a distributed environment. Many of these issues have arisen due to the erosion of observability caused by the abstraction of the underlying layers of infrastructure when using the cloud. Although on the surface, the usage of microservices toolchain doesn’t demand any adoption of new social practices. Yet, normally seen, it’s too late before the tech teams realize their old work habits are not helping them address the management costs introduced by this new technology. This makes it essential to understand why the successful adoption of cloud-native design patterns needs robust observable systems and the best DevOps and SRE practices. The Role of Observability in Strengthening the DevSecOps Pipeline The goal of observability is to enable people to reason about their internal state of systems and applications by providing a level of self-examination. For example, we can employ a combination of logs, metrics, and traces as debugging signals. However, if we look at the goal of observability, we find it agnostic in terms of how it can be accomplished. Looking back at monolithic systems, it was easier to anticipate the potential areas of failure and thereby making it easier to debug the entire system all by ourselves. Also, in addition to that, we were easily able to achieve the appropriate observability by using verbose application logging or coarse system-level metrics such as CPU utilization combined with additional insights. Yet, these tools and techniques no longer work for the new management challenges created by cloud-native systems. If we draw a comparison between legacy technologies like virtual machines and monolithic architectures and present-day technology like containerized microservices – we might find it difficult to get data with old observability tactics.This is because: Growing complexity from interdependencies between components. Transient state discarded after a container restart. Incompatible versioning between separately released components. For debugging anomalous issues, engineers need help to detect and understand issues from within their system, i.e., a new set of capabilities. Distributed tracing can help in capturing the state of system internals when specific events occur. It can be done by adding some context in form of many key-value pairs to each event – making it easier to capture all parts of our system. Quite helpful in breaking down and visualizing each individual step. As it shows the impact of the components while executing a specific request due to their interdependency. The paradigm shift from cause-based monitoring toward symptom-based monitoring means that we need the ability to explain the failures we see in practice instead of enumerating a growing list of known ways of failures. How Can Observability Empower the DevSecOps Pipeline? Chaos Engineering and Continuous Verification To detect the normal state of the system and how it deviates from it when some fault is introduced. The point to note is to understand your system’s baseline state to explain deviations from expected behavior. Feature Flagging A feature flag is a software development process for remotely enabling or disabling functionality without deploying code. Here new features are deployed without making them visible to users. This is preferred when certain codes cannot be tested exhaustively in pre-prod environments. They’re required to be deployed in production to observe their collective impact. Progressive Release Patterns Deployment strategies like Canary and Blue/Green deployment employ better precision in observability to learn when to stop the release and analyze any deviations if caused. Incident Analysis and Blameless Postmortem It’s a very useful report that not only describes the technical fault but also supplements it with what the human operators thought the fault/error to be. It facilitates the construction of clear models of your sociotechnical systems. Conclusion As DevOps and SRE practices continue to evolve, and with platform engineering growing as an umbrella discipline, it appears quite obvious that the future in tech looks more promising with innovative engineering practices emerging in our toolchains. However, all of this depends on having a better observability model as a core component to understanding modern complex systems.
Observability and monitoring are the two different types of DevOps measurement contributing to continuous delivery. These two are often used interchangeably, but there are key differences between them that separate them from each other. Observability is defined as the ability to understand a system's internal state by analyzing the data generated by the system, such as — logs, metrics, and traces. Monitoring is closely related to Observability. In general, monitoring is classified as the collection and analysis of data that are pulled from IT systems. By analyzing the information about the application’s usage patterns, monitoring helps IT teams detect vulnerabilities beforehand and come up with an appropriate solution. The broader distinction between observability and monitoring use cases lies in the extent of coverage. While monitoring typically focuses on a singular aspect, observability provides end-to-end visibility over the area of interest. Let’s have a quick look at the various observability and monitoring use cases. Observability Use Cases Application Modernization: To stay up with contemporary needs, businesses have rapidly updated their tech stack. This covers the utilization of microservices, cloud infrastructure, SaaS delivery strategies, etc. Modern apps are very well matched to user demands, but managing them has gotten more and more challenging. Particularly with specialized tools that are devoted to different domains. By offering a unified system and dashboard to manage the entire infrastructure, Observability tackles this issue. As a result, you can find problems fast and fix them. Cloud Infrastructure Observability: The adoption of cloud-native technologies has become standard. In most circumstances, they are both far more dynamic and economical. Cloud-native systems, however, are also extremely complex. There is always so much going on; one unfortunate side effect is frequently the emergence of undesirable operational silos. In the bid to manage these operation silos, teams often lose the understanding of how performance anomalies turn into a bad user experience down the line. Observability helps align efforts from various teams in cloud infrastructure by being a single source of insights to improve user experiences. Cost Optimization: Multiple factors affect the revenue. Some factors include overprovisioning of resources, delay in identifying the root cause, and loss of brand reputation due to bad user experiences. Full stack observability into your application and IT infrastructure provide reliable and timely information. You can use it to avoid overprovisioning and board up all the money-leaking holes. You can also use it to bring synergy between the application and infrastructure teams to ensure they both work towards the same goal. Application Security: Modern businesses always run the risk of encountering cybercrimes. Especially because it’s nearly impossible to overlook the security of complex cloud infrastructures manually. But observability can help make your application exponentially more secure. It can automatically detect issues and address them. Even better, it can identify a lot of possible security loopholes and patch them up for you. Long-Term Trending: Monitoring an application or platform's performance over time is also one of the beneficial applications of observability. Changes can be detected, and patterns outside the target can be found, leading to a request for human intervention. For instance, memory leaks in apps or services might be problematic even if they happen slowly. Resources can be changed to better suit the needs of apps that a larger number of people use. Monitoring Use Cases While observability takes care of broader business elements, monitoring is applied to specific aspects of the application and infrastructure. Let’s have a quick look at some of the monitoring use cases. Availability Monitoring: Nothing is worse than the system going down. Availability monitoring is used to minimize site downtime and its negative consequences. It monitors the availability of infrastructure components such as servers and notifies the webmaster in case something goes wrong. Web Performance Monitoring: It keeps track of the online service's accessibility as well as other precise criteria. This category includes factors like page load times, knowledge transmission errors, loading errors, etc., that are tracked. Application Performance Monitoring (APM): Mostly used for mobile apps, websites, and business applications, APM focuses on performance parameters that impact user experiences. It also includes the underlying infrastructure and dependencies of the application. API Monitoring: With most modern applications using APIs, it’s critical to ensure that API resources are available, functional, and responsive. API monitoring gathers all the relevant metrics for API to ensure unhindered operations. Real User Monitoring: Apps can perform differently in real-world scenarios compared to controlled environments. Real user monitoring gathers user interaction data and keeps you informed if the app performs the same way in both environments. It includes data around page load events, HTTP requests, application crashes, and more. Security Monitoring: Any stakeholder who uses the program will have a very miserable day at work if it is exposed to a security threat. Any security blunder can result in the demise of your application, whether it's caused by old hardware, negligent users, malware, or outside service providers. In order to spot and notify about any anomaly, security monitoring keeps an eye on the flow of information and events. Conclusion Both observability and monitoring have essential but slightly different roles to play in your DevOps strategy. However, both are essential to protecting the security perimeter of your business from unauthorized users. Although observability and monitoring are frequently used to evaluate performance and health, they also help to keep your IT systems secure. By increasing observability throughout your IT infrastructure, you can resolve problems swiftly and reduce the attack surface by removing exploitable routes. Additionally, logging can provide you with a more in-depth understanding of how your tech stack is used and can alert you to anomalies and unwanted access in advance. In this blog, we have highlighted the use cases for Observability and Monitoring. If you have any additional information, let us know in the comments below.
Software companies today often face two significant challenges — delivering at speed and innovating at scale. And DevOps helps address these challenges by imbibing automation throughout the software development lifecycle (SDLC) to develop and deliver high-quality software. Continuous Integration and Continuous Deployment (CI/CD) is the critical component of automation in a DevOps practice. It automates code builds, testing, and deployment so businesses can ship code changes faster and more reliably. However, one must continuously monitor their CI/CD pipeline to realize the DevOps promise. So, what is monitoring in DevOps, and how can businesses leverage it to tap optimal DevOps potential? Let's dig deep… What Is DevOps Monitoring? At its core, DevOps methodology is a data-driven approach. The ability to continuously improve the software quality completely relies on understanding how the code performs, what issues it introduces, and where to find improvement opportunities. This is where DevOps monitoring comes into the picture. DevOps monitoring is the practice of tracking and measuring the performance and health of code across every phase of the DevOps lifecycle, from planning, development, integration, and testing to deployment and operations. It facilitates a real-time, easy-to-consume, single-pane-of-glass view of your application and infrastructure performance. You can find significant threats early and fix them before they become a headache. DevOps monitoring gleans valuable data about everything from CPU utilization to storage space to application response times. Real-time streaming, visualizations, and historical replay are some key aspects of DevOps monitoring. What Is the Importance of DevOps Monitoring for Business Organizations? DevOps monitoring empowers business organizations to track, identify, and understand key metrics such as deployment frequency and failures, code error count, the cycle time of pull requests, rate of change failure, mean time to detect (MTTD), mean time to mitigate (MTTM), and mean time to remediate (MTTR). These valuable insights enable you to proactively identify the application or infrastructure issues and resolve them in real-time. Monitoring also optimizes the DevOps toolchain by identifying opportunities for automation. Here are some of the key benefits that highlight the importance of DevOps monitoring for business organizations: 1. High Visibility The Continuous Integration/Continuous Deployment (CI/CD) facilitated by DevOps enables frequent code changes. However, the increased pace of code changes makes the production environments increasingly complex. Moreover, introducing microservices and micro front-ends into the modern cloud-native ecosystem leads to various workloads operating in production, each with varying operational requirements of scale, redundancy, latency, and security. As a result, greater visibility into the DevOps ecosystem is crucial for teams to detect and respond to issues in real-time. This is where continuous monitoring plays a key role. DevOps monitoring gives a real-time view of your application performance as you deploy new versions of code in various environments. So you can identify and remediate issues earlier in the process and continue to test and monitor the subsequent code changes. Monitoring helps you validate new versions in real-time to ensure they are performing as planned, so you can confidently release new deployments. 2. Greater Collaboration The core principle of DevOps is to enable seamless collaboration between the development and operations teams. However, a lack of proper integration between the tools can impede coordination between different teams. This is where DevOps monitoring comes in. You can leverage continuous monitoring to get a complete, unified view of the entire DevOps pipeline. You can even track commits and pull requests to update the status of related Jira issues and notify the team. 3. High Experimentation The ever-evolving customer needs demand businesses to constantly experiment in order to optimize their product line through personalization and optimized conversion funnels. Teams often run hundreds of experiments and feature flags in the production environments, making it difficult to identify the reason for any degraded experience. Moreover, the increasing customer demand for uninterrupted services and applications can add vulnerabilities to applications. Continuous monitoring can help you easily monitor the experiments and ensure they work as expected. 4. Manage Changes Typically, most production outages are triggered by frequent code changes. Therefore, it is imperative to implement change management, especially for mission-critical applications, such as banking and healthcare applications. One needs to determine the risks associated with changes and automate the approval flows based on the risk of the change. And a comprehensive monitoring strategy can help you deal with these complexities. You only need a set of rich, flexible, and advanced monitoring tools. 5. Monitoring Distributed Systems Businesses often deal with distributed systems composed of many smaller, cross-company services. So, teams need to monitor and manage the performance of the systems they build and that of dependent systems. DevOps monitoring empowers you to deal with this dependent system monitoring with ease. 6. Shift-Left Testing Testing, when shifted left. i.e., when performed at the beginning of the software development lifecycle, it can significantly improve the code quality and reduce the test cycles. However, shift-left testing can be implemented only when you can streamline monitoring of the health of your pre-production environments and implement it early and frequently. Continuous monitoring also enables you to track user interactions and maintain application performance and availability before it is deployed to production environments. Benefits of Unified Monitoring and Analytics Unified monitoring and analytics help your DevOps teams to gain complete, unparalleled, end-to-end visibility across the entire software lifecycle. However, unifying monitoring data, analytics, and logs across your DevOps CI/CD ecosystem can be challenging and complex. Types of DevOps Monitoring Infrastructure monitoring Every IT business must set up and maintain an IT infrastructure in order to deliver products and services in a seamless and efficient manner. Typically, IT infrastructure includes everything that relates to IT, such as servers, data centers, networks, storage systems, and computer hardware and software. And DevOps monitoring helps in managing and monitoring this IT infrastructure, which is termed Infrastructure Monitoring. Infrastructure Monitoring collects the data from the IT infrastructure. It analyzes it to derive deep insights that help in tracking the performance and availability of computer systems, networks, and other IT systems. It also helps in gleaning real-time information on metrics such as CPU utilization, server availability, system memory, disk space, and network traffic. Infrastructure Monitoring covers hardware monitoring, OS monitoring, network monitoring, and application monitoring. Some of the popular Infrastructure Monitoring Tools are: Nagios Zabbix ManageEngine OpManager Solarwinds Prometheus Application Monitoring Application monitoring helps DevOps teams track runtime metrics of application performance, like application uptime, security, and log monitoring details. Application Performance Monitoring (APM) tools are used to monitor a wide range of metrics, including transaction time and volume, API and system responses, and overall application health. These metrics are derived in the form of graphical figures and statistics, so that DevOps teams can easily evaluate the application performance. Some of the popular application monitoring tools are: Appdynamics Dynatrace Datadog Uptime Robot Uptrends Splunk Network Monitoring Network monitoring tracks and monitors the performance and availability of the computer network and its components, such as firewalls, servers, routers, switches, and virtual machines (VMs). Typically, the network monitoring systems share five important data points, namely, discover, map, monitor, alert, and report. Networking monitoring helps identify network faults, measure performance, and optimize availability. This enables your DevOps teams to prevent network downtimes and failures. Some of the popular NMS tools are: Cacti Ntop Nmap Spiceworks Wireshark Traceroute Bandwidth Monitor Difference Between DevOps Monitoring and Observability DevOps teams often use monitoring and observability interchangeably. While both concepts play a crucial role in ensuring the safety and security of your systems, data, and applications, monitoring and observability are complementary capabilities and are not the same. Let's understand how both concepts are different: The differences between monitoring and observability depend on whether the data collected is predefined or not. While monitoring collects and analyses predefined data gleaned from individual systems, observability collects all data produced by all IT systems. Monitoring tools often use dashboards to display performance metrics and other KPIs, so DevOps teams can easily identify and remediate any IT issues. However, metrics can only highlight the issues your team can anticipate, as they are the ones that create the dashboards. This makes it challenging for DevOps teams to monitor the security and performance posture of the cloud-native environments and applications as the issues are often multi-faceted and unpredictable. On the other hand, observability tools leverage logs, traces, and metrics collected from the entire IT infrastructure to identify issues and proactively notify the teams to mitigate them. While monitoring tools provide useful data, DevOps teams need to leverage observability tools to get actionable insights into the health of the entire IT infrastructure and detect bugs or vulnerable attack vectors at the first sign of abnormal performance. However, observability doesn’t replace monitoring; rather, it facilitates better monitoring. The Best DevOps Monitoring Tools DevOps monitoring tools enable DevOps teams to implement continuous monitoring across the DevOps application development lifecycle and identify potential errors before releasing the code to production. However, you need to select the monitoring tools that best suit your business objectives so that you can achieve quality products with minimal costs. Here are some of the best DevOps monitoring tools available in the market: The Top 10 DevOps Monitoring Tools: 1. Splunk Splunk is the most-sought after monitoring tool when it comes to machine-generated data. In addition to monitoring, this popular tool is also used for searching, analyzing, investigating, troubleshooting, alerting, and reporting machine-generated data. Splunk complies with all the machine-generated data into a central index that enables DevOps teams to glean required insights quickly. The enticing aspect of Splunk is that it does not leverage any database to store its data; instead, it uses indexes for data storage. The tool helps in creating graphs, dashboards, and interactive visualizations, so your team can easily access data and find solutions to complex problems. Some of the key features of Splunk are: Real-time data processing. The tool accepts input data in various formats, including CSV and JSON. The tool allows you to easily search and analyze a particular result. The tool allows you to troubleshoot any performance issue. You can monitor any business metrics and make an informed decision. You can incorporate Artificial Intelligence into your data strategy with Splunk. 2. Datadog Datadog is a subscription-based SaaS platform that enables continuous monitoring of servers, applications, databases, tools, and services. This tool helps you foster a culture of observability, collaboration, and data-sharing, so you can get quick feedback on operational changes and improve development velocity and agility. Some of the key features of Datadog are: Extensible instrumentation and open APIs. Autodiscovery for automatic configuration of monitoring checks. Monitoring-as-code integrations with configuration management and deployment tools. Easily customizable monitoring dashboards. 80+ turn-key integrations. Get health and performance visibility of other DevOps tools. 3. Consul HashiCorp’s Consul is an open-source monitoring tool to connect, configure, and secure services in dynamic infrastructure. The tool enables you to create a central registry that tracks applications, services, and health statuses in real-time. The Consul's built-in UI or the APM integrations enable DevOps teams to monitor application performance and identify problem areas at the service level. The topology diagrams in the Consul UI help you visualize the communication flow between services registered in your mesh. Some of the key features of Consul are: The perfect tool for modern infrastructure. It provides a robust API. Easy to find services each application needs using DNS or HTTP. Supports multiple data centers. 4. Monit Monit is an open-source DevOps monitoring tool. It is used for managing and monitoring Unix systems. Your team can leverage Monit for monitoring daemon processes such as those started at system boot time from /etc/init/, For instance, Sendmail, apache, sshd, and MySQL. The tool can also be used for running similar programs, files, directories, and filesystems running on localhost and tracking the changes, such as size changes, timestamp changes, and checksum changes. Moreover, you can also use Monit for monitoring general system resources on localhosts, such as CPU usage, memory usage, and average load. Some key features of Monit are: The tool conducts automatic maintenance and repair. It also executes insightful actions during any event. The tool has built-in network tests for key Internet protocols, such as HTTP and SMTP. It is used to test programs or scripts at certain times. Monit is an autonomous system that does not rely on any plugins or special libraries to run. The tool easily compiles and runs on most flavors of Unix. 5. Nagios Nagios is one of the most popular DevOps monitoring tools. It is an open-source tool and is used for monitoring all mission-critical infrastructure components, including services, applications, operating systems, systems metrics, network protocols, and network infrastructure. The tool facilitates both agent-based and agentless monitoring, making it easy to monitor Linux and Windows servers. With Nagios, your DevOps teams can monitor all sorts of applications, including Windows applications, UNIX applications, Linux applications, and Web applications. Some key features of Nagios are: The tool supports hundreds of third-party addons so that you can monitor virtually anything, all in-house and external applications, services, and systems. Simplifies log data sorting process. Offers high network visibility and scalability. Provides complete monitoring of Java Management Extensions. 6. Prometheus Prometheus is an open-source monitoring toolkit primarily developed for system monitoring and alerting. The tool collects and stores the metrics information along with the timestamp at which it is recorded. Optional key-value pairs called labels are also stored with the metric information. The Prometheus tool ecosystem comprises multiple components, including the main Prometheus server for storing time series data, client libraries for instrumenting application code, a push gateway for handling short-lived jobs, and an alert manager for handling alerts. Some of the key features of the Prometheus tool are: The tool facilitates special-purpose exporters for services like StatsD, HAProxy, and Graphite. Supports Mac, Windows, and Linux. Facilitates monitoring of containerized environments such as Dockers and Kubernetes. Easily integrates with configuration tools like Ansible, Puppet, Chef, and Salt. The tool does not rely on distributed storage. The Prometheus tool supports multiple modes of graphing and dashboarding. 7. Sensu Sensu by Sumo Logic is a monitoring-as-code solution for mission-critical systems. This end-to-end observability pipeline enables your DevOps and SRE teams to collect, filter, and transform monitoring events and send them to the database of their choice. With a single Sensu cluster, you can easily monitor tens of thousands of nodes and quickly process over 100M events per hour. The tool facilitates enterprise-grade monitoring of production workloads, providing true multi-tenancy and multi-cluster visibility into your entire infrastructure. Some of the key features of the Sensu tool are: The tool supports external PostgreSQL databases, allowing you to scale Sensu limitlessly. Sensu’s inbuilt etc., handles 10K connected devices and 40K agents/clusters. The tool offers declarative configurations and a service-based approach to monitoring. Easily integrates with other DevOps monitoring solutions like Splunk, PageDuty, ServiceNow, and Elasticsearch. 8. Sematext Sematext is a one-stop solution for all your DevOps monitoring needs. Unlike other monitoring tools which offer only performance monitoring or only logging, or only experience monitoring, Sematext offers all the monitoring solutions that your DevOps team needs to troubleshoot their production and performance issues and move faster. With Sematext, your DevOps teams can monitor application performance, logs, metrics, real users, processes, servers, containers, databases, networks, inventory, alerts, events, and APIs. You can also do log management, synthetic monitoring, and JVM monitoring, among many other operations. Some of the key features of the Sematext tool are: The tool empowers you to map and monitor your entire infrastructure in real-time. Sematext provides better visibility for DevOps teams, System Admins, SREs, and Bizops. The tool offers fully managed Elasticsearch and Kibana, so you don’t need to spend on highly expensive Elasticsearch expert staff and infrastructure. The tool allows you to set up your free account in less than ten mins. Seamtext makes integration with external systems a breeze. 9. PagerDuty PagerDuty is an operations performance monitoring tool that enables your DevOps teams to assess the reliability and performance of the applications. The tool keeps your DevOps team connected with their code in production, leverages machine learning technology to identify issues, and alerts the team to address the errors as early as possible. That means your DevOps team spends less time responding to the incidents and has more time for building and innovating. Some of the key features of the PagerDuty tool are: PagerDuty comes with an intuitive alerting API, making it an excellent, easy-to-use incident response and alerting system. If an alert does not respond within the predefined time, the tool will auto-escalate by the originally established SLA. The tool supports data collection through a pull model over the HTTP. PagerDuty works as autonomous single server nodes with no dependency on distributed storage. It is a robust GUI tool for scheduling and escalation policy. The tool also supports multiple modes for dashboards and graphs. 10. AppDynamics AppDynamics is one of the most popular application performance monitoring tools available in the market. As a continuous monitoring tool, AppDynamics helps monitor your end users, applications, SAP, network, database, and infrastructure of both cloud and on-premises computing environments. With this tool, your DevOps team can easily gain complete visibility across servers, networks, containers, infrastructure components, applications, end-user sessions, and database transactions, so they can swiftly respond to performance issues. Some of the key features of the AppDynamics tool are: The tool seamlessly integrates with the world’s best technologies, such as AWS, Azure, Google Cloud, IBM, and Kubernetes. AppDynamics leverages machine learning to deliver instant root-cause diagnostics. The tool supports hybrid environment monitoring. Cisco full-stack observability with AppDynamics. The tool comes with a pay-per-use pricing model. DevOps Monitoring Use-cases: Real Examples of How Enterprises Use Monitoring Tools There’s no question the DevOps monitoring tools enable your DevOps team to automate the monitoring processes across the software development lifecycle. The monitoring tools enable your DevOps teams to identify code errors early, run code operations efficiently, and respond to code changes in usage rapidly. However, one must implement monitoring tools effectively to ensure complete success. Here are some prominent DevOps monitoring use cases that you can leverage to achieve DevOps success: Git Workflow Monitoring DevOps teams often encounter recurring codebase conflicts as a result of multiple developers working on the same project functionality simultaneously. Git enables your DevOps teams to manage and resolve conflicts, including commits and rollbacks. So, when you monitor your Git workflows, you can easily keep the code conflicts and ensure consistent progress in your project. Code Linting Code linting tools help your DevOps team analyze the code for style, syntax, and potential issues. With these tools, your DevOps team can ensure that they are adhering to the coding best practices and standards. Code linting enables you to identify and address code issues before they trigger runtime errors and other potential performance issues. With linting tools, you can ensure that your code is clean and consistent. Distributed Tracing Your DevOps teams need to be distributed tracing to streamline the monitoring and debugging processes of the microservices applications. Distributed tracing helps your team in understanding how applications interact with each other through APIs, making it easier to identify and address application performance issues. Continuous Integration/Continuous Deployment (CI/CD) Logs With CI/CD pipelines becoming the prominent element of the DevOps ecosystem, monitoring them is imperative for DevOps success. The continuous integration (CI) logs help ensure that your code builds are running smoothly. Otherwise, the logs inform you about the errors or warnings in your code builds. So, monitoring the CI logs helps identify the potential issues in your build pipeline and address them proactively. Likewise, the continuous deployment (CD) logs inform you about the overall pipeline health and status. So, monitoring the CD logs helps your DevOps teams easily troubleshoot any failed deployments and repair potential issues. Configuration Management Changelogs Configuration management changelogs help DevOps teams to gain deep visibility into the system’s health and important changes — both manual and automated. So, monitoring these logs empowers your team to track the changes made to the system, identify the unauthorized changes and rectify the issues. Code Instrumentation Code instrumentation is the process of adding code to an application. This process enables you to collect data about the application's performance and its operations route. This is crucial for tracing stack calls and knowing the contextual values. So, monitoring this code instrumentation results empowers you to measure the efficiency of your DevOps practices and gain visibility into the potential gaps, if any. It also helps you identify bugs and improve testing. Best Practices to Nail DevOps Monitoring Just like the adoption of DevOps itself, implementing a robust DevOps monitoring model needs a strategic combination of culture, process, and tooling. Though you can take inspiration from how your competitors are adopting DevOps monitoring, the right model you adopt must be on par with your unique organizational needs and SDLC. Here are some best practices that help you nail DevOps monitoring: Know What to Monitor Knowing what to monitor is half the battle won. So, even before you start implementing your DevOps monitoring strategy, it is crucial to know what needs to be monitored. Your monitoring objectives should focus on the server’s performance, vulnerabilities, user activity, and application logs. Define Development Goals Your DevOps monitoring strategy must be anchored with fixed development goals. These objectives help you understand how well your DevOps monitoring strategy is performing. A most sought-after method to ensure meeting the objectives is to track each sprint duration and measure the time taken to identify, document, and rectify the issues. Leveraging machine learning technology to automate configuration processes helps you save significant time and avoid manual errors. Monitor User Activity Monitoring user activity is one of the most important monitoring types. It helps you in tracking unusual requests, multiple login attempts, logging from unknown devices, and any suspicious user activity like a developer trying to access the admin account. By monitoring user activity, you can ensure that the right user is accessing the right resources. This process helps in preventing potential threats to the system and mitigate cyberattacks. Choose the Right Monitoring Tools Selecting the right set of DevOps monitoring tools from a rich choice of tooling available in the DevOps ecosystem is an arduous task. Picking the precise tool that is most suitable for your SDLC and your application’s infrastructure starts with an evaluation process. It primarily involves understanding the tool's features and functionality so you can easily assess whether it is best suited for application or infrastructure monitoring or not. So, here are some questions you need to ask to evaluate the DevOps monitoring tool: Does the tool integrate easily? Ensure that the monitoring tool easily integrates with your DevOps pipeline and your broader technology stack. This helps you atomate actions and alerts with ease. Does the tool offer something new? The DevOps monitoring tools that glean a rich amount of data are a cut above the rest. However, more data demands more attention, uses more storage and needs more management. So, select monitoring tools that pave way for new avenues of monitoring rather than those that provide normal benefits. Does the tool offer a unified dashboard? Your DevOps ecosystem comprises many services, libraries, and products working together. So, a DevOps monitoring tool that offers a unified dashboard helps you gain complete, real-time visibility across the DevOps lifecycle and make it easier to identify issues and gaps. Does the tool integrate alerts with your existing tooling? Your DevOps monitoring tools must enable your DevOps teams to respond quickly to alerts and notifications. Check whether the tool supports alerting directly or integrates with your existing notification tools. Also, ensure that the tooling you're evaluating integrates with your organization's existing reporting and analytics tools. What type of audit logs does the tool provide? Understanding the current state of your system is important, especially when something goes south. The action-by-action record provided by the audit logs enables you to understand what has happened, identify which process or person is responsible, analyze the root cause, and provide a basis for learning the gaps in the system. So, what type of audit logs does your tool provide, and how do they provide crucial information? What are the tool’s data storage needs? DevOps monitoring tools generate massive amounts of data. So, it is important to understand the storage needs of the tool and the cloud storage costs to keep useful history without storing data beyond its useful life. What types of diagnostics does the tool offer? Check whether the monitoring tool alerts you to symptoms or helps you in diagnosing the underlying issue. Choose comprehensive tools, such as application performance monitoring platforms, to understand what's happening in complex scenarios, such as several asynchronous microservices working together.
This is an article from DZone's 2022 Performance and Site Reliability Trend Report.For more: Read the Report Software testing is straightforward — every input => known output. However, historically, a great deal of testing has been guesswork. We create user journeys, estimate load and think time, run tests, and compare the current result with the baseline. If we don't spot regressions, the build gets a thumbs up, and we move on. If there is a regression, back it goes. Most times, we already know the output even though it needs to be better defined — less ambiguous with clear boundaries of where a regression falls. Here is where machine learning (ML) systems and predictive analytics enter: to end ambiguity. After tests finish, performance engineers do more than look at the result averages and means; they will look at percentages. For example, 10 percent of the slowest requests are caused by a system bug that creates a condition that always impacts speed. We could manually correlate the properties available in the data; nevertheless, ML will link data properties quicker than you probably would. After determining the conditions that caused the 10 percent of bad requests, performance engineers can build test scenarios to reproduce the behavior. Running the test before and after the fix will assert that it's corrected. Figure 1: Overall confidence in performance metrics Source: Data from TechBeacon Performance With Machine Learning and Data Science Machine learning helps software development evolve, making technology sturdier and better to meet users' needs in different domains and industries. We can expose cause-effect patterns by feeding data from the pipeline and environments into deep learning algorithms. Predictive analytics algorithms, paired with performance engineering methodologies, allow more efficient and faster throughput, offering insight into how end users will use the software in the wild and helping you reduce the probability of defects reaching production. By identifying issues and their causes early, you can make course corrections early in the development lifecycle and prevent an impact on production. You can draw on predictive analytics to improve your application performance in the following ways. Identify root causes. You can focus on other areas needing attention using machine learning techniques to identify root causes for availability or performance problems. Predictive analytics can then analyze each cluster's various features, providing insights into the changes we need to make to reach the ideal performance and avoid bottlenecks. Monitor application health. Performing real-time application monitoring using machine-learning techniques allows organizations to catch and respond to degradation promptly. Most applications rely on multiple services to get the complete application's status; predictive analytics models will correlate and analyze the data when the application is healthy to identify whether incoming data is an outlier. Predict user load. We have relied on peak user traffic to size our infrastructure for the number of users accessing the application in the future. This approach has limitations as it does not consider changes or other unknown factors. Predictive analytics can help indicate the user load and better prepare to handle it, helping teams plan their infrastructure requirements and capacity utilization. Predict outages before it's too late. Predicting application downtime or outages before they happen helps to take preventive action. The predictive analytics model will follow the previous outage breadcrumbs and continue monitoring for similar circumstances to predict future failures. Stop looking at thresholds and start analyzing data. Observability and monitoring generate large amounts of data that can take up to several hundred megabytes a week. Even with modern analytic tools, you must know what you're looking for in advance. This leads to teams not looking directly at the data but instead setting thresholds as triggers for action. Even mature teams look for exceptions rather than diving into their data. To mitigate this, we integrate models with the available data sources. The models will then sift through the data and calculate the thresholds over time. Using this technique, where models are fed and aggregate historical data, provides thresholds based on seasonality rather than set by humans. Algorithm-set thresholds trigger fewer alerts; however, these are far more actionable and valuable. Analyze and correlate across datasets. Your data is mostly time series, making it easier to look at a single variable over time. Many trends come from the interactions of multiple measures. For example, response time may drop only when various transactions are made simultaneously with the same target. For a human, that's almost impossible, but properly trained algorithms will spot these correlations. The Importance of Data in Predictive Analytics "Big data" often refers to datasets that are, well, big, come in at a fast pace, and are highly variable in content. Their analysis requires specialized methods so that we can extract patterns and information from them. Recently, improvements in storage, processors, parallelization of processes, and algorithm design enabled the processing of large quantities of data in a reasonable time, allowing wider use of these methods. And to get meaningful results, you must ensure the data is consistent. For example, each project must use the same ranking system, so if one project uses 1 as critical and another uses 5 — like when people use DEFCON 5 when they mean DEFCON 1 — the values must be normalized before processing. Predictive algorithms are composed of the algorithm and the data it's fed, and software development generates immense amounts of data that, until recently, sat idle, waiting to be deleted. However, predictive analytics algorithms can process those files, for patterns we can't detect, to ask and answer questions based on that data, such as: Are we wasting time testing scenarios that aren't used? How do performance improvements correlate with user happiness? How long will it take to fix a specific defect? These questions and their answers are what predictive analytics is used for — to better understand what is likely to happen. The Algorithms The other main component in predictive analysis is the algorithm; you'll want to select or implement it carefully. Starting simple is vital as models tend to grow in complexity, becoming more sensitive to changes in the input data and distorting predictions. They can solve two categories of problems: classification and regression (see Figure 2). Classification is used to forecast the result of a set by classifying it into categories starting by trying to infer labels from the input data like "down" or "up." Regression is used to forecast the result of a set when the output variable is a set of real values. It will process input data to predict, for example, the amount of memory used, the lines of code written by a developer, etc. The most used prediction models are neural networks, decision trees, and linear and logistic regression. Figure 2: Classification vs. regression Neural Networks Neural networks learn by example and solve problems using historical and present data to forecast future values. Their architecture allows them to identify intricate relations lurking in the data in a way that replicates how our brain detects patterns. They contain many layers that accept data, compute predictions, and provide output as a single prediction. Decision Trees A decision tree is an analytics method that presents the results in a series of if/then choices to forecast specific options' potential risks and benefits. It can solve all classification problems and answer complex issues. As shown in Figure 3, decision trees resemble an upsidedown tree produced by algorithms identifying various ways of splitting data into branch-like segments that illustrate a future decision and help to identify the decision path. One branch in the tree might be users who abandoned the cart if it took more than three seconds to load. Below that one, another branch might indicate whether they identify as female. A "yes" answer would raise the risk as analytics show that females are more prone to impulse buys, and the delay creates a pause for pondering. Figure 3: Decision tree example Linear and Logistic Regression Regression is one of the most popular statistical methods. It is crucial when estimating numerical numbers, such as how many resources per service we will need to add during Black Friday. Many regression algorithms are designed to estimate the relationship among variables, finding key patterns in big and mixed datasets and how they relate. It ranges from simple linear regression models, which calculate a straight-line function that fits the data, to logistic regression, which calculates a curve (Figure 4). OVERVIEW OF LINEAR AND LOGISTIC REGRESSION Linear Regression Logistic Regression Used to define a value on a continuous range, such as the risk of user traffic peaks in the following months. It's a statistical method where the parameters are predicted based on older sets. It best suits binary classification: datasets where y = 0 or 1, where 1 represents the default class. Its name derives from its transformation function being a logistic function. It's expressed as y = a + bx, where x is an input set used to determine the output y. Coefficients a and b are used to quantify the relation between x and y, where a is the intercept and b is the slope of the line. It's expressed by the logistic function:where β0 is the intercept and β1 is the rate. It uses training data to calculate the coefficients, minimizing the error between the predicted and actual outcomes. The goal is to fit a line nearest to most points, reducing the distance or error between y and the line. It forms an S-shaped curve where a threshold is applied to transform the probability into a binary classification. Figure 4: Linear regression vs. logistic regression These are supervised learning methods, as the algorithm solves for a specific property. Unsupervised learning is used when you don't have a particular outcome in mind but want to identify possible patterns or trends. In this case, the model will analyze as many combinations of features as possible to find correlations from which humans can act. Figure 5: Supervised vs. unsupervised learning Shifting Left in Performance Engineering Using the previous algorithms to gauge consumer sentiment on products and applications makes performance engineering more consumer-centric. After all the information is collected, it must be stored and analyzed through appropriate tools and algorithms. This data can include error logs, test cases, test results, production incidents, application log files, project documentation, event logs, tracing, and more. We can then apply it to the data to get various insights to: Analyze defects in environments Estimate the impact on customer experience Identify issue patterns Create more accurate test scenarios, and much more This technique supports the shift-left approach in quality, allowing you to predict how long it will take to do performance testing, how many defects you are likely to identify, and how many defects might make it to production, achieving better coverage from performance tests and creating realistic user journeys. Issues such as usability, compatibility, performance, and security are prevented and corrected without impacting users. Here are some examples of information that will improve quality: Type of defect In what phase was the defect identified What the root cause of the defect is Whether the defect is reproducible Once you understand this, you can make changes and create tests to prevent similar issues sooner. Conclusion Software engineers have made hundreds and thousands of assumptions since the dawn of programming. But digital users are now more aware and have a lower tolerance for bugs and failures. Businesses are also competing to deliver a more engaging and flawless user experience through tailored services and complex software that is becoming more difficult to test. Today, everything needs to work seamlessly and support all popular browsers, mobile devices, and apps. A crash of even a few minutes can cause a loss of thousands or millions of dollars. To prevent issues, teams must incorporate observability solutions and user experience throughout the software lifecycle. Managing the quality and performance of complex systems requires more than simply executing test cases and running load tests. Trends help you tell if a situation is under control, getting better, or worsening — and how fast it improves or worsens. Machine learning techniques can help predict performance problems, allowing teams to course correct. To quote Benjamin Franklin, "An ounce of prevention is worth a pound of cure." This is an article from DZone's 2022 Performance and Site Reliability Trend Report.For more: Read the Report
Joana Carvalho
Performance Engineer,
Postman
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere
Chris Ward
Zone Leader,
DZone
Ted Young
Director of Open Source Development,
LightStep