Performance refers to how well an application conducts itself compared to an expected level of service. Today's environments are increasingly complex and typically involve loosely coupled architectures, making it difficult to pinpoint bottlenecks in your system. Whatever your performance troubles, this Zone has you covered with everything from root cause analysis, application monitoring, and log management to anomaly detection, observability, and performance testing.
Are you looking to get away from proprietary instrumentation? Are you interested in open-source observability but lack the knowledge to just dive right in? If so, this workshop is for you, designed to expand your knowledge and understanding of open-source observability tooling that is available to you today. Dive right into a free, online, self-paced, hands-on workshop introducing you to Prometheus. Prometheus is an open-source systems monitoring and alerting tool kit that enables you to hit the ground running with discovering, collecting, and querying your observability today. Over the course of this workshop, you will learn what Prometheus is and is not, install it, start collecting metrics, and learn all the things you need to know to become effective at running Prometheus in your observability stack. In this article, you'll be introduced to some basic concepts and learn what Prometheus is and is not before you start getting hands-on with it in the rest of the workshop. Introduction to Prometheus I'm going to get you started on your learning path with this first lab that provides a quick introduction to all things needed for metrics monitoring with Prometheus. Note this article is only a short summary, so please see the complete lab found online here to work through it in its entirety yourself: The following is a short overview of what is in this specific lab of the workshop. Each lab starts with a goal. In this case, it is fairly simple: This lab introduces you to the Prometheus project and provides you with an understanding of its role in the cloud-native observability community. The start is with background on the beginnings of the Prometheus project and how it came to be part of the Cloud Native Computing Foundation (CNCF) as a graduated project. This leads to some basic outlining of what a data point is, how they are gathered, what makes them a metric and all using a high-level metaphor. You are then walked through what Prometheus is, why we are looking at this project as an open-source solution for your cloud-native observability solution, and more importantly, what Prometheus can not do for you. A basic architecture is presented walking you through the most common usage and components of a Prometheus metrics deployment. Below you see the final overview of the Prometheus architecture: You are then presented with an overview of all the powerful features and tools you'll find in your new Prometheus toolbox: Dimensional data model - For multi-faceted tracking of metrics Query language - PromQL provides a powerful syntax to gather flexible answers across your gathered metrics data. Time series processing - Integration of metrics time series data processing and alerting Service discovery - Integrated discovery of systems and services in dynamic environments Simplicity and efficiency - Operational ease combined with implementation in Go language Finally, you'll touch on the fact that Prometheus has a very simple design and functioning principle and that this has an impact on running it as a highly available (HA) component in your architecture. This aspect is only briefly touched upon, but don't worry: we cover this in more depth later in the workshop. At the end of each lab, including this one, you are presented with the end state (in this case we have not yet done anything), a list of references for further reading, a list of ways to contact me for questions, and a link to the next lab. Missed Previous Labs? This is one lab in the more extensive free online workshop. Feel free to start from the very beginning of this workshop here if you missed anything previously: You can always proceed at your own pace and return any time you like as you work your way through this workshop. Just stop and later restart Perses to pick up where you left off. Coming up Next I'll be taking you through the following lab in this workshop where you'll learn how to install and set up Prometheus on your own local machine. Stay tuned for more hands-on material to help you with your cloud-native observability journey.
Effective resource management is essential to ensure that no single client or task monopolizes resources and causes performance issues for others. Shuffle-sharding is a valuable technique to achieve this. By dividing resources into equal segments and periodically shuffling them, shuffle-sharding can distribute resources evenly and prevent any client or task from relying on a specific segment for too long. This technique is especially useful in scenarios with a risk of bad actors or misbehaving clients or tasks. In this article, we'll explore shuffle-sharding in-depth, discussing how it balances resources and improves overall system performance. Model Before implementing shuffle-sharding, it's important to understand its key dimensions, parameters, trade-offs, and potential outcomes. Building a model and simulating different scenarios can help you develop a deeper understanding of how shuffle-sharding works and how it may impact your system's performance and availability. That's why we'll explore shuffle-sharding in more detail, using a Colab notebook as our playground. We'll discuss its benefits, limitations, and the factors to consider before implementing it. By the end of this post, you'll have a better idea of what shuffle-sharding can and can't do and whether it's a suitable technique for your specific use case. In practical applications, shuffle-sharding is often used to distribute available resources evenly among different queries or tasks. This can involve mapping different clients or connections to subsets of nodes or containers or assigning specific cores to different query types (or 'queries' to be short). In our simulation, we linked queries to CPU cores. The goal is to ensure that the available CPU resources are shared fairly among all queries, preventing any query from taking over the resources and negatively impacting the performance of others. To achieve this, each query is limited to only 25% of the available cores, and no two queries have more than one core in common. This helps to minimize overlap between queries and prevent any one query from consuming more than its fair share of resources. Here is a visualization of how the cores (columns) are allocated to each query type (rows) and how overlap between them is minimized (each query has exactly three cores assigned): The maximum overlap between rows is just one bit (i.e., 33% of the assigned cores), and the average overlap is ~0.5 bits (less than 20% or assigned cores). This means that even if one query type were to take over 100% of the allocated cores, the others would still have enough capacity to run, unlike uniform assignment, where a rogue query could monopolize the whole node CPU. To evaluate the impact of different factors on the performance of the system, we conducted four simulations, each with different dimensions: Uniform query assignment, where any query type can be assigned to any core, vs. shuffle-sharding assignment, where queries are assigned based on shuffle-sharding principles. Baseline, where all queries are well-behaved, vs. the presence of a bad query type that takes 100% of the CPU resources and never completes. Let's take a look at the error rate (which doesn't include the bad query type as it fails in 100% of cases): Looking at the error rate plot, we can observe that the Baseline Uniform scenario has a slightly higher saturation point than the Baseline Shuffle-Sharding scenario, reaching around a 5% higher query rate before the system starts to degrade. This is expected as shuffle-sharding partitions the CPU cores into smaller sections, which can reduce the efficiency of the resource allocation when the system is near its full capacity. However, when comparing the performance of Uniform vs. Shuffle-Sharding in the presence of a noisy neighbor that seizes all the available resources, we see that Shuffle-Sharding outperforms Uniform by approximately 25%. This demonstrates that the benefits of shuffle-sharding in preventing resource taking over and ensuring fair resource allocation outweigh the minor reduction in efficiency under normal operating conditions. In engineering, trade-offs are a fact of life, and shuffle-sharding is no exception. While it may decrease the saturation point during normal operations, it significantly reduces the risk of outages when things don't go as planned — which is inevitable sooner or later. System Throughput In addition to error rates, another key metric for evaluating the performance of a system is throughput, which measures the number of queries the system can handle depending on the QPS rate. To analyze the system's throughput, we looked at the same data from a different angle. In the plot below, we can see a slight difference between the Baseline Uniform and Baseline Shuffle-Sharding scenarios, where Uniform slightly outperforms Sharding at low QPS rates. However, the difference becomes much more significant when we introduce a faulty client/task/query that monopolizes all the available resources. In this scenario, Shuffle-Sharding outperforms Uniform by a considerable margin: Latency Now let's look at the latency graphs, which show the average, median (p50), and p90 latency of the different scenarios. In the Uniform scenario, we can see that the latency of all requests approaches the timeout threshold pretty quickly at all levels. This demonstrates that resource monopolization can have a significant impact on the performance of the entire system, even for well-behaved queries: In the Sharding scenario, we can observe that the system handles the situation much more effectively and keeps the latency of well-behaving queries as if nothing happened until it reaches a saturation point, which is very close to the total system capacity. This is an impressive result, highlighting the benefits of shuffle-sharding in isolating the latency impact of a noisy/misbehaving neighbor. CPU Utilization At the heart of shuffle-sharding is the idea of distributing resources to prevent the whole ship from sinking, but only allowing a section to become flooded. To illustrate this concept, let's look at the simulated CPU data. In the Uniform simulation, CPU saturation occurs almost instantly, even with low QPS rates. This highlights how resource monopolization can significantly impact system performance, even under minimal load. However, in the Sharding simulation, the system maintains consistent and reliable performance, even under challenging conditions. These simulation results align with the latency and error graphs we saw earlier — the bad actor was isolated and only impacted 25% of the system's capacity, leaving the remaining 75% available for well-behaved queries. Closing Thoughts In conclusion, shuffle-sharding is a valuable technique for balancing limited resources between multiple clients or tasks in distributed systems. Its ability to prevent resource monopolization and ensure fair resource allocation can improve system stability and maintain consistent and reliable performance, even in the presence of faulty clients, tasks, or queries. Additionally, shuffle-sharding can help reduce the blast radius of faults and improve system isolation, highlighting its importance in designing more stable and reliable distributed systems. Of course, in the event of outages, other measures should be applied, such as rate-limiting the offending client/task or moving it to dedicated capacity to minimize system impact. Effective operational practices are critical to maximize the benefits of shuffle-sharding. For other techniques that can be used in conjunction with shuffle-sharding, check out the links below. Also, feel free to play around with the simulation and change the parameters such as the number of query types, cores, etc. to get a sense of the model and how different parameters may affect it. This post continues the theme of improving service performance/availability touched on in previous posts Ensuring Predictable Performance in Distributed Systems, Navigating the Benefits and Risks of Request Hedging for Network Services, and FIFO vs. LIFO: Which Queueing Strategy Is Better for Availability and Latency?.
Site Reliability Engineering (SRE) is a systematic and data-driven approach to improving the reliability, scalability, and efficiency of systems. It combines principles of software engineering, operations, and quality assurance to ensure that systems meet performance goals and business objectives. This article discusses the key elements of SRE, including reliability goals and objectives, reliability testing, workload modeling, chaos engineering, and infrastructure readiness testing. The importance of SRE in improving user experience, system efficiency, scalability, and reliability, and achieving better business outcomes is also discussed. Site Reliability Engineering (SRE) is an emerging field that seeks to address the challenge of delivering high-quality, highly available systems. It combines the principles of software engineering, operations, and quality assurance to ensure that systems meet performance goals and business objectives. SRE is a proactive and systematic approach to reliability optimization characterized by the use of data-driven models, continuous monitoring, and a focus on continuous improvement. SRE is a combination of software engineering and IT operations, combining the principles of DevOps with a focus on reliability. The goal of SRE is to automate repetitive tasks and to prioritize availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. The benefits of adopting SRE include increased reliability, faster resolution of incidents, reduced mean time to recovery, improved efficiency through automation, and increased collaboration between development and operations teams. In addition, organizations that adopt SRE principles can improve their overall system performance, increase the speed of innovation, and better meet the needs of their customers. SRE 5 Why's 1. Why Is SRE Important for Organizations? SRE is important for organizations because it ensures high availability, performance, and scalability of complex systems, leading to improved user experience and better business outcomes. 2. Why Is SRE Necessary in Today's Technology Landscape? SRE is necessary for today's technology landscape because systems and infrastructure have become increasingly complex and prone to failures, and organizations need a reliable and efficient approach to manage these systems. 3. Why Does SRE Involve Combining Software Engineering and Systems Administration? SRE involves combining software engineering and systems administration because both disciplines bring unique skills and expertise to the table. Software engineers have a deep understanding of how to design and build scalable and reliable systems, while systems administrators have a deep understanding of how to operate and manage these systems in production. 4. Why Is Infrastructure Readiness Testing a Critical Component of SRE? Infrastructure Readiness Testing is a critical component of SRE because it ensures that the infrastructure is prepared to support the desired system reliability goals. By testing the capacity and resilience of infrastructure before it is put into production, organizations can avoid critical failures and improve overall system performance. 5. Why Is Chaos Engineering an Important Aspect of SRE? Chaos Engineering is an important aspect of SRE because it tests the system's ability to handle and recover from failures in real-world conditions. By proactively identifying and fixing weaknesses, organizations can improve the resilience and reliability of their systems, reducing downtime and increasing confidence in their ability to respond to failures. Key Elements of SRE Reliability Metrics, Goals, and Objectives: Defining the desired reliability characteristics of the system and setting reliability targets. Reliability Testing: Using reliability testing techniques to measure and evaluate system reliability, including disaster recovery testing, availability testing, and fault tolerance testing. Workload Modeling: Creating mathematical models to represent system reliability, including Little's Law and capacity planning. Chaos Engineering: Intentionally introducing controlled failures and disruptions into production systems to test their ability to recover and maintain reliability. Infrastructure Readiness Testing: Evaluating the readiness of an infrastructure to support the desired reliability goals of a system. Reliability Metrics In SRE Reliability metrics are used in SRE is used to measure the quality and stability of systems, as well as to guide continuous improvement efforts. Availability: This metric measures the proportion of time a system is available and functioning correctly. It is often expressed as a percentage and calculated as the total uptime divided by the total time the system is expected to be running. Response Time: This measures the time it takes for the infrastructure to respond to a user request. Throughput: This measures the number of requests that can be processed in a given time period. Resource Utilization: This measures the utilization of the infrastructure's resources, such as CPU, memory, Network, Heap, caching, and storage. Error Rate: This measures the number of errors or failures that occur during the testing process. Mean Time to Recovery (MTTR): This metric measures the average time it takes to recover from a system failure or disruption, which provides insight into how quickly the system can be restored after a failure occurs. Mean Time Between Failures (MTBF): This metric measures the average time between failures for a system. MTBF helps organizations understand how reliable a system is over time and can inform decision-making about when to perform maintenance or upgrades. Reliability Testing In SRE Performance Testing: This involves evaluating the response time, processing time, and resource utilization of the infrastructure to identify any performance issues under BAU scenario 1X load. Load Testing: This technique involves simulating real-world user traffic and measuring the performance of the infrastructure under heavy loads 2X Load. Stress Testing: This technique involves applying more load than the expected maximum to test the infrastructure's ability to handle unexpected traffic spikes 3X Load. Chaos or Resilience Testing: This involves simulating different types of failures (e.g., network outages, hardware failures) to evaluate the infrastructure's ability to recover and continue operating. Security Testing: This involves evaluating the infrastructure's security posture and identifying any potential vulnerabilities or risks. Capacity Planning: This involves evaluating the current and future hardware, network, and storage requirements of the infrastructure to ensure it has the capacity to meet the growing demand. Workload Modeling In SRE Workload Modeling is a crucial part of SRE, which involves creating mathematical models to represent the expected behavior of systems. Little's Law is a key principle in this area, which states that the average number of items in a system, W, is equal to the average arrival rate (λ) multiplied by the average time each item spends in the system (T): W = λ * T. This formula can be used to determine the expected number of requests a system can handle under different conditions. Example: Consider a system that receives an average of 200 requests per minute, with an average response time of 2 seconds. We can calculate the average number of requests in the system using Little's Law as follows: W = λ * T W = 200 requests/minute * 2 seconds/request W = 400 requests This result indicates that the system can handle up to 400 requests before it becomes overwhelmed and reliability degradation occurs. By using the right workload modeling, organizations can determine the maximum workload that their systems can handle and take proactive steps to scale their infrastructure and improve reliability and allow them to identify potential issues and design solutions to improve system performance before they become real problems. Tools and techniques used for modeling and simulation: Performance Profiling: This technique involves monitoring the performance of an existing system under normal and peak loads to identify bottlenecks and determine the system's capacity limits. Load Testing: This is the process of simulating real-world user traffic to test the performance and stability of an IT system. Load testing helps organizations identify performance issues and ensure that the system can handle expected workloads. Traffic Modeling: This involves creating a mathematical model of the expected traffic patterns on a system. The model can be used to predict resource utilization and system behavior under different workload scenarios. Resource Utilization Modeling: This involves creating a mathematical model of the expected resource utilization of a system. The model can be used to predict resource utilization and system behavior under different workload scenarios. Capacity Planning Tools: There are various tools available that automate the process of capacity planning, including spreadsheet tools, predictive analytics tools, and cloud-based tools. Chaos Engineering and Infrastructure Readiness in SRE Chaos Engineering and Infrastructure Readiness are important components of a successful SRE strategy. They both involve intentionally inducing failures and stress into systems to assess their strength and identify weaknesses. Infrastructure readiness testing is done to verify the system's ability to handle failure scenarios, while chaos engineering tests the system's recovery and reliability under adverse conditions. The benefits of chaos engineering include improved system reliability, reduced downtime, and increased confidence in the system's ability to handle real-world failures and proactively identify and fix weaknesses; organizations can avoid costly downtime, improve customer experience, and reduce the risk of data loss or security breaches. Integrating chaos engineering into DevOps practices (CI\CD) can ensure their systems are thoroughly tested and validated before deployment. Methods of chaos engineering typically involve running experiments or simulations on a system to stress and test its various components, identify any weaknesses or bottlenecks, and assess its overall reliability. This is done by introducing controlled failures, such as network partitions, simulated resource exhaustion, or random process crashes, and observing the system's behavior and response. Example Scenarios for Chaos Testing Random Instance Termination: Selecting and terminating an instance from a cluster to test the system response to the failure. Network Partition: Partitioning the network between instances to simulate a network failure and assess the system's ability to recover. Increased Load: Increasing the load on the system to test its response to stress and observing any performance degradation or resource exhaustion. Configuration Change: Altering a configuration parameter to observe the system's response, including any unexpected behavior or errors. Database Failure: Simulating a database failure by shutting it down and observing the system's reaction, including any errors or unexpected behavior. By conducting both chaos experiments and infrastructure readiness testing, organizations can deepen their understanding of system behavior and improve their resilience and reliability. Conclusion In conclusion, SRE is a critical discipline for organizations that want to deliver highly reliable, highly available systems. By adopting SRE principles and practices, organizations can improve system reliability, reduce downtime, and improve the overall user experience.
Connectivity is so daunting. By far, we are all used to instant connectivity that puts the world at our fingertips. We can purchase, post, and pick anything, anywhere, with the aid of desktops and devices. But how does it happen? How do different applications in different devices connect with each other? Allowing us to place an order, plan a vacation, make a reservation, etc., with just a few clicks. API—Application Programming Interface—the unsung hero of the modern world, which is often underrated. What Is an API? APIs are building blocks of online connectivity. They are a medium for multiple applications, data, and devices to interact with each other. Simply put, an API is a messenger that takes a request and tells the system what you want to do and then returns the response to the user. Documentation is drafted for every API, including specifications regarding how the information gets transferred between two systems. Why Is API Important? APIs can interact with third-party applications publicly. Ultimately, upscaling the reach of an organization’s business. So, when we book a ticket via “Bookmyshow.com,” we fill in details regarding the movie we plan to watch, like: Movie name Locality 3D/2D Language These details are fetched by API and are taken to servers associated with different movie theatres to bring back the collected response from multiple third-party servers. Providing users the convenience of choosing which theatre fits best. This is how different applications interact with each other. Instead of making a large application and adding more functionalities via code in it. The present time demands microservice architecture wherein we create multiple individually focused modules with well-defined interfaces and then combine them to make a scalable, testable product. The product or software, which might have taken a year to deliver, can now be delivered in weeks with the help of microservice architecture. API serves as a necessity for microservice architecture. Consider an application that delivers music, shopping, and bill payments service to end users under a single hood. The user needs to log into the app and select the service for consumption. API is needed for collaborating different services for such applications, contributing to an overall enhanced UX. API also enables an extra layer of security to the data. Neither the user’s data is overexposed to the server: nor the server data is overexposed to the user. Say, in the case of movies, API tells the server what the user would like to watch and then the user what they have to give to redeem the service. Ultimately, you get to watch your movie, and the service provider is credited accordingly. API Performance Monitoring and Application Performance Monitoring Differences As similar as these two terms sound, they perform distinctive checks on the overall application connectivity: Application performance monitoring: is compulsory for high-level analytics regarding how well the app is executing on the integral front. It facilitates an internal check on the internal connectivity of software. The following are key data factors that must be monitored: Server loads User adoption Market share Downloads Latency Error logging API performance monitoring: is required to check if there are any bottlenecks outside the server; it could be in the cloud or load balancing service. These bottlenecks are not dependent on your application performance monitoring but are still considered to be catastrophic as they may abrupt the service for end users. It facilitates a check on the external connectivity of the software, aiding its core functionalities: Back-end business operations Alert operations Web services Why Is API Performance Monitoring a Necessity? 1. Functionality With the emergence of modern agile practices, organizations are adopting a virtuous cycle of developing, testing, delivering, and maintaining by monitoring the response. It is integral to involve API monitoring as part of the practice. A script must be maintained in relevance to the appropriate and latest versions of the functional tests for ensuring a flawless experience of services to the end user. Simply put, if your API goes south, your app goes with it. For instance, in January 2016, a worldwide outage was met by the Twitter API. This outage lasted more than an hour, and within that period, it impacted thousands of websites and applications. 2. Performance Organizations are open to performance reckoning if they neglect to thoroughly understand the process involved behind every API call. Also, API monitoring helps acknowledge which APIs are performing better and how to improvise on the APIs with weaker performance displays. 3. Speed/Responsiveness Users can specify the critical API calls in the performance monitoring tool. Set their threshold (acceptable response time) to ensure they get alerted if the expected response time deteriorates. 4. Availability With the help of monitoring, we can realize whether all the services hosted by our applications are accessible 24×7. Why Monitor API When We Can Test it? Well, an API test can be highly composite, considering the large number of multi-steps that get involved. This creates a problem in terms of the frequency required for the test to take place. This is where monitoring steps in! Allowing every hour band check regarding the indispensable aspects. Helping us focus on what’s most vital to our organization. How To Monitor API Performance Identify your dependable APIs—Recognize your employed APIs, whether they are third-party or partner APIs. Internally connecting or externally? Comprehend the functional and transactional use cases for facilitating transparency towards the services being hosted — improves performance and MTTR (Mean Time to Repair). Realize whether you have test cases required to monitor. Whether you have existing test cases that need to be altered, or is there an urgency for new ones to be developed? Know the right tool—API performance monitoring is highly dependent on the tool being used. You need an intuitive, user-friendly, result-optimizing tool with everything packed in. Some commonly well known platforms to perform API performance testing are: CA Technologies(Now Broadcom Inc.) AlertSite Rigor Runscope One more factor to keep a note of is API browser compatibility to realize how well your API can aid different browsers. To know more about this topic, follow our blog about “API and Browser Compatibility.” Conclusion API performance monitoring is a need of modern times that gives you a check regarding the internal as well as external impact of the services hosted by a product. Not everyone cares to bother about APIs, but we are glad you did! Hoping this article will help expand your understanding of the topic. Cheers!
Most engineers don’t want to spend more time than necessary to keep their clusters highly available, secure, and cost-efficient. How do you make sure your Google Kubernetes engine cluster is ready for the storms ahead? Here are fourteen optimization tactics divided into three core areas of your cluster. Use them to build a resource-efficient, highly-available GKE cluster with airtight security. Here are the three core sections in this article: Resource Management Security Networking Resource Management Tips for a GKE Cluster 1. Autoscaling Use the autoscaling capabilities of Kubernetes to make sure your workloads perform well during peak load and control costs in times of normal or low loads. Kubernetes gives you several autoscaling mechanisms. Here’s a quick overview to get you up to speed: Horizontal pod autoscaler: HPA adds or removes pod replicas automatically based on utilization metrics. It works great for scaling stateless and stateful applications. Use it with Cluster Autoscaler to shrink the number of active nodes when the pod number decreases. HPA also comes in handy for handling workloads with short high utilization spikes. Vertical pod autoscaler: VPA increases and lowers the CPU and memory resource requests of pod containers to make sure the allocated and actual cluster usage match. If your HPA configuration doesn’t use CPU or memory to identify scaling targets, it’s best to use it with VPA. Cluster autoscaler: it dynamically scales the number of nodes to match the current GKE cluster utilization. Works great with workloads designed to meet dynamically changing demand. Best Practices for Autoscaling in a GKE Cluster Use HPA, VPA and Node Auto Provisioning (NAP): By using HPA, VPA and NAP together, you let GKE efficiently scale your cluster horizontally (pods) and vertically (nodes). VPA sets values for CPU, memory requests, and limits for containers, while NAP manages node pools and eliminates the default limitation of starting new nodes only from the set of user-created node pools. Check if your HPA and VPA policies clash: Make sure the VPA and HPA policies don’t interfere with each other. For example, if HPA only relies on CPU and memory metrics, HPA and VPA cannot work together. Also, review your bin packing density settings when designing a new GKE cluster for a business-or purpose-class tier of service. Use instance weighted scores: This allows you to determine how much of your chosen resource pool will be dedicated to a specific workload and ensure that your machine is best suited for the job. Slash costs with a mixed-instance strategy: Using mixed instances helps achieve high availability and performance at a reasonable cost. It’s basically about choosing from various instance types, some of which may be cheaper and good enough for lower-throughput or low-latency workloads. Or you could run a smaller number of machines with higher specs.This way it would bring costs down because each node requires Kubernetes to be installed on it, which always adds a little overhead. 2. Choose the Topology for Your GKE Cluster You can choose from two types of clusters: Regional topology: In a regional Kubernetes cluster, Google replicates the control plane and nodes across multiple zones in a single region. Zonal topology: In a zonal cluster, they both run in a single compute zone specified upon cluster creation. If your application depends on the availability of a cluster API, pick a regional cluster topology, which offers higher availability for the cluster’s control plane API. Since it’s the control plane that does jobs like scaling, replacing, and scheduling pods, if it becomes unavailable, you’re in for reliability trouble. On the other hand, regional clusters have nodes spreaded across multiple zones, which may increase your cross-zone network traffic and, thus, costs. 3. Bin Pack Nodes for Maximum Utilization This is a smart approach to GKE cost optimization shared by the engineering team at Delivery Hero. To maximize node utilization, it’s best to add pods to nodes in a compacted way. This opens the door to reducing costs without any impact on performance. This strategy is called bin packing and goes against the Kubernetes that favors even distribution of pods across nodes. Source: Delivery Hero The team at Delivery Hero used GKE Autopilot, but its limitations made the engineers build bin packing on their own. To achieve the highest node utilization, the team defines one or more node pools in a way that allows nodes to include pods in the most compacted way (but leaving some buffer for the shared CPU). By merging node pools and performing bin packing, pods fit into nodes more efficiently, helping Delivery Hero to decrease the total number of nodes by ~60% in that team. 4. Implement Cost Monitoring Cost monitoring is a big part of resource management because it lets you keep an eye on your expenses and instantly act on cost spike alerts. To understand your Google Kubernetes Engine costs better, implement a monitoring solution that gathers data about your cluster’s workload, total cost, costs divided by labels or namespaces, and overall performance. The GKE usage metering enables you to monitor resource usage, map workloads, and estimate resource consumption. Enable it to quickly identify the most resource-intensive workloads or spikes in resource consumption. This step is the bare minimum you can do for cost monitoring. Tracking these 3 metrics is what really makes a difference in how you manage your cloud resources: daily cloud spend, cost per provisioned and requested CPU, and historical cost allocation. 5. Use Spot VMs Spot VMs are an incredible cost-saving opportunity—you can get a discount reaching up to 91% off the pay-as-you-go pricing. The catch is that Google may reclaim the machine at any time, so you need to have a strategy in place to handle the interruption. That’s why many teams use spot VMs for workloads that are fault-and interruption-tolerant like batch processing jobs, distributed databases, CI/CD operations, or microservices. Best Practices for Running Your GKE Cluster on Spot VMs How to choose the right spot VM? Pick a slightly less popular spot VM type—it’s less likely to get interrupted. You can also check its frequency of interruption (the rate at which this instance reclaimed capacity within the trailing month). Set up spot VM groups: This increases your chances of snatching the machines you want. Managed instance groups can request multiple machine types at the same time, adding new spot VMs when extra resources become available. Security Best Practices for GKE Clusters Red Hat 2022 State of Kubernetes and Container Security found that almost 70% of incidents happen due to misconfigurations. GKE secures your Kubernetes cluster in many layers, including the container image, its runtime, the cluster network, and access to the cluster API server. Google generally recommends implementing a layered approach to GKE cluster security. The most important security aspects to focus on are: Authentication and authorization Control plane Node Network security 1. Follow CIS Benchmarks All of the key security areas are part of Center of Internet Security (CIS) Benchmarks, a globally recognized best practices collection that gives you a helping hand for structuring security efforts. When you use a managed service like GKE, you don’t have the power over all the CIS Benchmark items. But some things are definitely within your control, like auditing, upgrading, and securing the cluster nodes and workloads. You can either go through the CIS Benchmarks manually or use a tool that does the benchmarking job for you. We recently introduced a container security module that scans your GKE cluster to check for any benchmark discrepancies and prioritizes issues to help you take action. 2. Implement RBAC Role-Based Access Control (RBAC) is an essential component for managing access to your GKE cluster. It lets you establish more granular access to Kubernetes resources at cluster and namespace levels, and develop detailed permission policies. CIS GKE Benchmark 6.8.4 emphasizes that teams give preference to RBAC over the legacy Attribute-Based Access Control (ABAC). Another CIS GKE Benchmark (6.8.3) suggests using groups for managing users. This is how you make controlling identities and permissions simpler and don’t need to update the RBAC configuration whenever you add or remove users from the group. 3. Follow the Principle of Least Privilege Make sure to grant user accounts only the privileges that are essential for them to do their jobs. Nothing more than that. The CIS GKE Benchmark 6.2.1 states: Prefer not running GKE clusters using the Compute Engine default service account. By default, nodes get access to the Compute Engine service account. This comes in handy for multiple applications but opens the door to more permissions than necessary to run your GKE cluster. Create and use a minimally privileged service account instead of the default—and follow the same principle everywhere else. 4. Boost Your Control Plane’s Security Google implements the Shared Responsibility Model to manage the GKE control plane components. Still, you’re the one responsible for securing nodes, containers, and pods. The Kubernetes API server uses a public IP address by default. You can secure it with the help of authorized networks and private Kubernetes clusters that let you assign a private IP address. Another way to improve your control plane’s security is performing a regular credential rotation. The TLS certificates and cluster certificate authority get rotated automatically when you initiate the process. 5. Protect Node Metadata CIS GKE Benchmarks 6.4.1 and 6.4.2 point out two critical factors that may compromise your node security—and fall on your plate. Kubernetes deprecated the v0.1 and v1beta1 Compute Engine metadata server endpoints in 2020. The reason was that they didn’t enforce metadata query headers. Some attacks against Kubernetes clusters rely on access to the metadata server of virtual machines. The idea here is to extract credentials. You can fight such attacks with workload identity or metadata concealment. 6. Upgrade GKE Regularly Kubernetes often releases new security features and patches, so keeping your deployment up-to-date is a simple but powerful approach to improving your security posture. The good news about GKE is that it patches and upgrades the control plane automatically. The node auto-upgrade also upgrades cluster nodes and CIS GKE Benchmark 6.5.3 recommends that you keep this setting on. If you want to disable the auto-upgrade for any reason, Google suggests performing upgrades on a monthly basis and following the GKE security bulletins for critical patches. Networking Optimization Tips for Your GKE Cluster 1. Avoid Overlaps With IP Addresses From Other Environments When designing a larger Kubernetes cluster, keep in mind to avoid overlaps with IP addresses used in your other environments. Such overlaps might cause issues with routing if you need to connect cluster VPC network to on-premises environments or other cloud service provider networks via Cloud VPN or Cloud Interconnect. 2. Use GKE Dataplane V2 and Network Policies If you want to control traffic flow at the OSI layer 3 or 4 (IP address or port level), you should consider using network policies. Network policies allow specifying how a pod can communicate with other network entities (pods, services, certain subnets, etc.). To bring your cluster networking to the next level, GKE Dataplane V2 is the right choice. It’s based on eBPF and provides extended integrated network security and visibility experience. Adding to that, if the cluster uses the Google Kubernetes Engine Dataplane V2, you don’t need to enable network policies explicitly as the former manages services routing, network policy enforcement, and logging. 3. Use Cloud DNS for GKE Pod and Service DNS resolution can be done without the additional overhead of managing the cluster-hosted DNS provider. Cloud DNS for GKE requires no additional monitoring, scaling, or other management activities as it’s a fully hosted Google service. Conclusion In this article, you have learned how to optimize your GKE cluster with fourteen tactics across security, resource management, and networking for high availability and optimal cost. Hopefully, you have taken away some helpful information that will help you in your career as a developer.
Kubernetes is an open-source container orchestration platform that helps manage and deploy applications in a cloud environment. It is used to automate the deployment, scaling, and management of containerized applications. It is an efficient way to manage application health with Kubernetes probes. This article will discuss Kubernetes probes, the different types available, and how to implement them in your Kubernetes environment. What Are Kubernetes Probes? Kubernetes probes are health checks that are used to monitor the health of applications and services in a Kubernetes cluster. They are used to detect any potential problems with applications or services and identify potential resource bottlenecks. Probes are configured to run at regular intervals and send a signal to the Kubernetes control plane if they detect any issues with the application or service. Kubernetes probes are typically implemented using the Kubernetes API, which allows them to query the application or service for information. This information can then be used to determine the application’s or service’s health. Kubernetes probes can also be used to detect changes in the application or service and send a notification to the Kubernetes control plane, which can then take corrective action. Kubernetes probes are an important part of the Kubernetes platform, as they help ensure applications and services run smoothly. They can be used to detect potential problems before they become serious, allowing you to take corrective action quickly. A successful message for a readiness probe indicates the container is ready to receive traffic. If a readiness probe is successful, the container is considered ready and can begin receiving requests from other containers, services, or external clients. A successful message for a liveness probe indicates the container is still running and functioning properly. If a liveness probe succeeds, the container is considered alive and healthy. If a liveness probe fails, the container is considered to be in a failed state, and Kubernetes will attempt to restart the container to restore its functionality. Both readiness and liveness probes return a successful message with an HTTP response code of 200-399 or a TCP socket connection is successful. If the probe fails, it will return a non-2xx HTTP response code or a failed TCP connection, indicating that the container is not ready or alive. A successful message for a Kubernetes probe indicates the container is ready to receive traffic or is still running and functioning properly, depending on the probe type. Types of Kubernetes Probes There are three types of probes: Startup probes Readiness probes Liveness probes 1. Startup Probes A startup probe is used to determine if a container has started successfully. This type of probe is typically used for applications that take longer to start up, or for containers that perform initialization tasks before they become ready to receive traffic. The startup probe is run only once, after the container has been created, and it will delay the start of the readiness and liveness probes until it succeeds. If the startup probe fails, the container is considered to have failed to start and Kubernetes will attempt to restart the container. 2. Readiness Probes A readiness probe is used to determine if a container is ready to receive traffic. This type of probe is used to ensure a container is fully up and running and can accept incoming connections before it is added to the service load balancer. A readiness probe can be used to check the availability of an application’s dependencies or perform any other check that indicates the container is ready to serve traffic. If the readiness probe fails, the container is removed from the service load balancer until the probe succeeds again. 3. Liveness Probes A liveness probe is used to determine if a container is still running and functioning properly. This type of probe is used to detect and recover from container crashes or hang-ups. A liveness probe can be used to check the responsiveness of an application or perform any other check that indicates the container is still alive and healthy. If the liveness probe fails, Kubernetes will attempt to restart the container to restore its functionality. Each type of probe has its own configuration options, such as the endpoint to check, the probe interval, and the success and failure thresholds. By using these probes, Kubernetes can ensure containers are running and healthy and can take appropriate action if a container fails to respond. How To Implement Kubernetes Probes Kubernetes probes can be implemented in a few different ways: The first way is to use the Kubernetes API to query the application or service for information. This information can then be used to determine the application’s or service’s health. The second way is to use the HTTP protocol to send a request to the application or service. This request can be used to detect if an application or service is responsive, or if it is taking too long to respond. The third way is to use custom probes to detect specific conditions in an application or service. Custom probes can be used to detect things such as resource usage, slow responses, or changes in the application or service. Once you have decided which type of probe you will be using, you can then configure the probe using the Kubernetes API. You can specify the frequency of the probe, the type of probe, and the parameters of the probe. Once the probe is configured, you can deploy it to the Kubernetes cluster. Today, I’ll show how to configure health checks to your application deployed on Kubernetes with HTTP protocol to check whether the application is ready, live, and starting as per our requirements. Prerequisites A Kubernetes cluster from any cloud provider. You can even use Minikube or Kind to create a single-node cluster. Docker Desktop to containerize the application. Docker Hub to push the container image to the Docker registry. Node.js installed, as we will use a sample Node.js application. Tutorial Fork the sample application here. Get into the main application folder with the command: cd Kubernetes-Probes-Tutorial Install the dependencies with the command: npm install Run the application locally using the command: node app.js You should see the application running on port 3000. In the application folder, you should see the Dockerfile with the following code content: # Use an existing node image as base image FROM node:14-alpine # Set the working directory in the container WORKDIR /app # Copy package.json and package-lock.json to the container COPY package*.json ./ # Install required packages RUN npm install # Copy all files to the container COPY . . # Expose port 3000 EXPOSE 3000 # Start the application CMD [ "npm", "start" ] This Dockerfile is to create a container image of our application and push it to the Docker Hub. Next, build and push your image to the Docker Hub using the following command: docker buildx build --platform=linux/arm64 --platform=linux/amd64 -t docker.io/Docker Hub username/image name:tag --push -f ./Dockerfile . You can see the pushed image on your Docker Hub account under repositories. Next, deploy the manifest files. In the application folder, you will notice a deployment.yaml file with health checks/probes included, such as readiness and liveness probes. Note: we have used our pushed image name in the YAML file: apiVersion: apps/v1 kind: Deployment metadata: name: notes-app-deployment labels: app: note-sample-app spec: replicas: 2 selector: matchLabels: app: note-sample-app template: metadata: labels: app: note-sample-app spec: containers: - name: note-sample-app-container image: pavansa/note-sample-app resources: requests: cpu: "100m" imagePullPolicy: IfNotPresent ports: - containerPort: 3000 readinessProbe: httpGet: path: / port: 3000 livenessProbe: httpGet: path: / port: 3000 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 You can see the image used and the health checks configured in the above YAML file. We are all set with our YAML file. Assuming you have a running cluster ready, let’s deploy the above mentioned manifest file with the command: kubectl apply -f deployment.yaml You should see the successful deployment of the file: “deployment.apps/notes-app-deployment created.” Let’s check the pod status with the following command to make sure the pods are running: kubectl get pods Let’s describe a pod using the following command: kubectl describe pod notes-app-deployment-7fb6f5d74b-hw5fn You can see the “Liveness and Readiness” status when you describe the pods. Next, let’s check the events section. You can see the different events, such as “scheduled,” “pulled,” “created,” and “started.” All the pod events were successful. Conclusion Kubernetes probes are an important part of the Kubernetes platform, as they help ensure applications and services run smoothly. They can be used to detect potential problems before they become serious, allowing you to take corrective action quickly. Kubernetes probes come in two types: Liveness probes Readiness probes Along with custom probes that can be used to detect specific conditions in an application or service. Implementing Kubernetes probes is a straightforward process that can be done using the Kubernetes API. If you are looking for a way to ensure the health of your applications and services, Kubernetes probes are the way to go. So, make sure to implement Kubernetes probes in your Kubernetes environment today!
If you use Windows, you will want to monitor Windows Events. A recent contribution of a distribution of the OpenTelemetry (OTel) Collector makes it much easier to monitor Windows Events with OpenTel. You can utilize this receiver either in conjunction with any OTel collector: including the OpenTelemetry Collector. In this article, we will be using observIQ’s distribution of the collector. Below are steps to get up and running quickly with the distribution. We will be shipping Windows Event logs to a popular backend: Google Cloud Ops. You can find out more on the GitHub page here. What Signals Matter? Windows Events logs record many different operating system processes, application activity, and account activity. Some relevant log types you will want to monitor include: Application status: This contains information about applications installed or running on the system. If an application crashes, these logs may contain an explanation for the crash. Security logs: These logs contain information about the system’s audit and authentication processes. For example, if a user attempts to log into the system or use administrator privileges. System logs: These logs contain information about Windows-specific processes, such as driver activity. All of the above categories can be gathered with the Windows Events receiver, so let’s get started. Before You Begin If you don’t already have an OpenTelemetry collector built with the latest Windows Events receiver installed, you’ll need to do that first. The distribution of the OpenTelemetry Collector we’re using today includes the Windows Events receiver (and many others) and can be installed with the one-line installer here. For Linux To install using the installation script, you run: sudo sh -c "$(curl -fsSlL https://github.com/observiq/observiq-otel-collector/releases/latest/download/install_unix.sh)" install_unix.sh To install directly with the appropriate package manager, head here. For Windows To install the collector on Windows, run the Powershell command below to install the MSI with no UI: PowerShell msiexec /i "https://github.com/observIQ/observiq-otel-collector/releases/latest/download/observiq-otel-collector.msi" /quiet Alternately, for an interactive installation download the latest MSI. After downloading the MSI, double-click the “download to open the installation wizard” and follow the instructions to configure and install the collector. For more installation information see installing on Windows. For macOS To install using the installation script, you run: sudo sh -c "$(curl -fsSlL https://github.com/observiq/observiq-otel-collector/releases/latest/download/install_macos.sh)" install_macos.sh For more installation guidance, see installing on macOS. For Kubernetes To deploy the collector on Kubernetes, further documentation can be found at our observiq-otel-collector-k8s repository. Configuring the Windows Events Receiver Now the distribution is installed, let’s navigate to your OpenTelemetry configuration file. If you’re using the observIQ Collector, you’ll find it at the following location: C:\Program Files\observIQ OpenTelemetry Collector\config.yaml (Windows) Edit the configuration file to include the Windows Events receiver as shown below: YAML receivers: windowseventlog: channel: application You can edit the specific output by adding/editing the following directly below the receiver name and channel: YAML { "channel": "Application", "computer": "computer name", "event_id": { "id": 10, "qualifiers": 0 }, "keywords": "[Classic]", "level": "Information", "message": "Test log", "opcode": "Info", "provider": { "event_source": "", "guid": "", "name": "otel" }, "record_id": 12345, "system_time": "2022-04-15T15:28:08.898974100Z", "task": "" } Configuring the Log Fields You can adjust the following fields in the configuration to adjust what types of logs you want to ship: Field Default Description channel required The windows event log channel to monitor. max_reads 100 On first startup, where to start reading logs from the API. Options are beginning or end. start_at end Number of client connections (excluding connections from replicas). poll_interval 1s The interval at which the channel is checked for new log entries. This check begins again after all new bodies have been read. attributes {} A map of key: value pairs to add to the entry's attributes. resource {} A map of key: value pairs to add to the entry’s resource. operators [] An array of operators. See below for more details: converter {max_flush_count: 100,flush_interval: 100ms,worker_count: max(1,runtime.NumCPU()/4)} A map of key: value pairs to configure the [entry.Entry][entry_link] to [pdata.LogRecord][pdata_logrecord_link] converter, more info can be found [here][converter_link] Operators Each operator performs a simple responsibility, such as parsing a timestamp or JSON. Chain together operators to process logs into the desired format: Every operator has a type. Every operator can be given a unique id. If you use the same type of operator more than once in a pipeline, you must specify an id. Otherwise, the id defaults to the value of type. Operators will output to the next operator in the pipeline. The last operator in the pipeline will emit from the receiver. Optionally, the output parameter can be used to specify the id of another operator to which logs will be passed directly. Only parsers and general-purpose operators should be used. Conclusion As you can see, this distribution makes it much simpler to work with OpenTelemetry collector—with a single-line installer, integrated receivers, exporter, and processor pool—and will help you implement OpenTelemetry standards wherever it is needed in your systems.
Apache Flink monitoring support is now available in the open-source OpenTelemetry collector. You can check out the OpenTelemetry repo here! You can utilize this receiver in conjunction with any OTel collector: including the OpenTelemetry Collector and other distributions of the collector. Today we'll use observIQ’s OpenTelemetry distribution, and shipping Apache Flink telemetry to a popular backend: Google Cloud Ops. You can find out more on the GitHub page: https://github.com/observIQ/observiq-otel-collector What Signals Matter? Apache Flink is an open-source, unified batch processing and stream processing framework. The Apache Flink collector records 29 unique metrics, so there is a lot of data to pay attention to. Some specific metrics that users find valuable are: Uptime and restarts Two different metrics that record the duration a job has continued uninterrupted, and the number of full restarts a job has committed, respectively. Checkpoints A number of metrics monitoring checkpoints can tell you the number of active checkpoints, the number of completed and failed checkpoints, and the duration of ongoing and past checkpoints. Memory Usage Memory-related metrics are often relevant to monitor. The Apache Flink collector ships metrics that can tell you about total memory usage, both present and over time, mins and maxes, and how the memory is divided between different processes. All of the above categories can be gathered with the Apache Flink receiver — so let’s get started. Before You Begin If you don’t already have an OpenTelemetry collector built with the latest Apache Flink receiver installed, you’ll need to do that first. The Collector distro we're using includes the Apache Flink receiver (and many others) and is simple to install with a one-line installer. Configuring the Apache Flink Receiver Navigate to your OpenTelemetry configuration file. If you’re following along, you’ll find it in the following location: /opt/observiq-otel-collector/config.yaml (Linux) For the Collector, edit the configuration file to include the Apache Flink receiver as shown below: YAML receivers: flinkmetrics: endpoint: http://localhost:8081 collection_interval: 10s Processors: nop: # Resourcedetection is used to add a unique (host.name) # to the metric resource(s),... target_key: namespace exporters: nop: # Add the exporter for your preferred destination(s) service: pipelines: metrics: receivers: [flinkmetrics] processors: [nop] exporters: [nop] If you’re using the Google Ops Agent instead, you can find the relevant config file here. Viewing the Metrics Collected If you followed the steps detailed above, the following Apache Flink metrics will now be delivered to your preferred destination. Metric Description flink.jvm.cpu.load The CPU usage of the JVM for a jobmanager or taskmanager. flink.jvm.cpu.time The CPU time used by the JVM for a jobmanager or taskmanager. flink.jvm.memory.heap.used The amount of heap memory currently used. flink.jvm.memory.heap.committed The amount of heap memory guaranteed to be available to the JVM. flink.jvm.memory.heap.max The maximum amount of heap memory that can be used for memory management. flink.jvm.memory.nonheap.used The amount of non-heap memory currently used. flink.jvm.memory.nonheap.committed The amount of non-heap memory guaranteed to be available to the JVM. flink.jvm.memory.nonheap.max The maximum amount of non-heap memory that can be used for memory management. flink.jvm.memory.metaspace.used The amount of memory currently used in the Metaspace memory pool. flink.jvm.memory.metaspace.committed The amount of memory guaranteed to be available to the JVM in the Metaspace memory pool. flink.jvm.memory.metaspace.max The maximum amount of memory that can be used in the Metaspace memory pool. flink.jvm.memory.direct.used The amount of memory used by the JVM for the direct buffer pool. flink.jvm.memory.direct.total_capacity The total capacity of all buffers in the direct buffer pool. flink.jvm.memory.mapped.used The amount of memory used by the JVM for the mapped buffer pool. flink.jvm.memory.mapped.total_capacity The number of buffers in the mapped buffer pool. flink.memory.managed.used The amount of managed memory currently used. flink.memory.managed.total The total amount of managed memory. flink.jvm.threads.count The total number of live threads. flink.jvm.gc.collections.count The total number of collections that have occurred. flink.jvm.gc.collections.time The total time spent performing garbage collection. flink.jvm.class_loader.classes_loaded The total number of classes loaded since the start of the JVM. flink.job.restart.count The total number of restarts since this job was submitted, including full restarts and fine-grained restarts. flink.job.last_checkpoint.time The end to end duration of the last checkpoint. flink.job.last_checkpoint.size The total size of the last checkpoint. flink.job.checkpoint.count The number of checkpoints completed or failed. flink.job.checkpoint.in_progress The number of checkpoints in progress. flink.task.record.count The number of records a task has. flink.operator.record.count The number of records an operator has. flink.operator.watermark.output The last watermark this operator has emitted. This OpenTelemetry collector can help companies looking to implement OpenTelemetry standards.
The concept of observability involves understanding a system’s internal states through the examination of logs, metrics, and traces. This approach provides a comprehensive system view, allowing for a thorough investigation and analysis. While incorporating observability into a system may seem daunting, the benefits are significant. One well-known example is PhonePe, which experienced a 2000% growth in its data infrastructure and a 65% reduction in data management costs with the implementation of a data observability solution. This helped mitigate performance issues and minimize downtime. The impact of Observability-Driven Development (ODD) is not limited to just PhonePe. Numerous organizations have experienced the benefits of ODD, with a 2.1 times higher likelihood of issue detection and a 69% improvement in the mean time to resolution. What Is ODD? Observability-Driven Development (ODD) is an approach to shift left observability to the earliest stage of the software development life cycle. It uses trace-based testing as a core part of the development process. In ODD, developers write code while declaring desired output and specifications that you need to view the system’s internal state and process. It applies at a component level and as a whole system. ODD is also a function to standardize instrumentation. It can be across programming languages, frameworks, SDKs, and APIs. What Is TDD? Test-Driven Development (TDD) is a widely adopted software development methodology that emphasizes the writing of automated tests prior to coding. The process of TDD involves defining the desired behavior of software through the creation of a test case, running the test to confirm its failure, writing the minimum necessary code to make the test pass, and refining the code through refactoring. This cycle is repeated for each new feature or requirement, and the resulting tests serve as a safeguard against potential future regressions. The philosophy behind TDD is that writing tests compels developers to consider the problem at hand and produce focused, well-structured code. Adherence to TDD improves software quality and requirement compliance and facilitates the early detection and correction of bugs. TDD is recognized as an effective method for enhancing the quality, reliability, and maintainability of software systems. Comparison of Observability and Testing-Driven Development Similarities Observability-Driven Development (ODD) and Testing-Driven Development (TDD) strive towards enhancing the quality and reliability of software systems. Both methodologies aim to ensure that software operates as intended, minimizing downtime and user-facing issues while promoting a commitment to continuous improvement and monitoring. Differences Focus: The focus of ODD is to continuously monitor the behavior of software systems and their components in real time to identify potential issues and understand system behavior under different conditions. TDD, on the other hand, prioritizes detecting and correcting bugs before they cause harm to the system or users and verifies software functionality to meet requirements. Time and resource allocation: Implementing ODD requires a substantial investment of time and resources for setting up monitoring and logging tools and infrastructure. TDD, in contrast, demands a significant investment of time and resources during the development phase for writing and executing tests. Impact on software quality: ODD can significantly impact software quality by providing real-time visibility into system behavior, enabling teams to detect and resolve issues before they escalate. TDD also has the potential to significantly impact software quality by detecting and fixing bugs before they reach production. However, if tests are not comprehensive, bugs may still evade detection, potentially affecting software quality. Moving From TDD to ODD in Production Moving from a Test-Driven Development (TDD) methodology to an Observability-Driven Development (ODD) approach in software development is a significant change. For several years, TDD has been the established method for testing software before its release to production. While TDD provides consistency and accuracy through repeated tests, it cannot provide insight into the performance of the entire application or the customer experience in a real-world scenario. The tests conducted through TDD are isolated and do not guarantee the absence of errors in the live application. Furthermore, TDD relies on a consistent production environment for conducting automated tests, which is not representative of real-world scenarios. Observability, on the other hand, is an evolved version of TDD that offers full-stack visibility into the infrastructure, application, and production environment. It identifies the root cause of issues affecting the user experience and product release through telemetry data such as logs, traces, and metrics. This continuous monitoring and tracking help predict the end user’s perception of the application. Additionally, with observability, it is possible to write and ship better code before it reaches the source control, as it is part of the set of tools, processes, and culture. Best Practices for Implementing ODD Here are some best practices for implementing Observability-Driven Development (ODD): Prioritize observability from the outset: Start incorporating observability considerations in the development process right from the beginning. This will help you identify potential issues early and make necessary changes in real time. Embrace an end-to-end approach: Ensure observability covers all aspects of the system, including the infrastructure, application, and end-user experience. Monitor and log everything: Gather data from all sources, including logs, traces, and metrics, to get a complete picture of the system’s behavior. Use automated tools: Utilize automated observability tools to monitor the system in real-time and alert you of any anomalies. Collaborate with other teams: Collaborate with teams, such as DevOps, QA, and production, to ensure observability is integrated into the development process. Continuously monitor and improve: Regularly monitor the system, analyze data, and make improvements as needed to ensure optimal performance. Embrace a culture of continuous improvement: Encourage the development team to embrace a culture of continuous improvement and to continuously monitor and improve the system. Conclusion Both Observability-Driven Development (ODD) and Test-Driven Development (TDD) play an important role in ensuring the quality and reliability of software systems. TDD focuses on detecting and fixing bugs before they can harm the system or its users, while ODD focuses on monitoring the behavior of the software system in real-time to identify potential problems and understand its behavior in different scenarios. Did I miss any of the important information regarding the same? Let me know in the comments section below.
It's midnight in the dim and cluttered office of The New York Times, currently serving as the "situation room." A powerful surge of traffic is inevitable. During every major election, the wave would crest and crash against our overwhelmed systems before receding, allowing us to assess the damage. We had been in the cloud for years, which helped some. Our main systems would scale– our articles were always served– but integration points across backend services would eventually buckle and burst under the sustained pressure of insane traffic levels. However, this night in 2020 differed from similar election nights in 2014, 2016, and 2018. That's because this traffic surge was simulated, and an election wasn't happening. Pushing to the Point of Failure Simulation or not, this was prod, so the stakes were high. There was suppressed horror as J-Kidd–our system that brought ad targeting parameters to the front end–went down hard. It was as if all the ligaments had been ripped from the knees of the pass-first point guard for which it had been named. Ouch. I'm sorry, Jason; it was for the greater good. J-Kidd wasn't the only system that found its way to the disabled list. That was the point of the whole exercise, to push our systems until they failed. We succeeded. Or failed, depending on your point of view. The next day the team made adjustments. We decoupled systems, implemented failsafes, and returned to the court for game 2. As a result, the 2020 election was the first I can remember where the on-call engineers weren't on the edge of their seats, white-knuckling their keyboards…At least not for system reliability reasons. Pre-Mortems and Chaos Engineering We referred to that exercise as a "premortem." Its conceptual roots can be traced back to the idea of chaos engineering introduced by site reliability engineers. For those unfamiliar, chaos engineering is a disciplined methodology for intentionally introducing points of failure within systems to understand their thresholds better and improve resilience. It was largely popularized by the success of Netflix's Simian Army, a suite of programs that would automatically introduce chaos by removing servers and regions and introducing other points of failure into production. All in the name of reliability and resiliency. While this idea isn't completely foreign to data engineering, it can certainly be described as an extremely uncommon practice. No data engineer in their right mind has looked at their to-do list, the unfilled roles on their team, the complexity of their pipelines, and then said: "This needs to be harder. Let's introduce some chaos." That may be part of the problem. Data teams need to think beyond providing snapshots of data quality to the business and start thinking about how to build and maintain reliable data systems at scale. We cannot afford to overlook data quality management, and it plays an increasingly large role in critical operations. For example, just this year, we witnessed how deleting one file, and an out-of-sync legacy database could ground more than 4,000 flights. Of course, you can't just copy and paste software engineering concepts straight into data engineering playbooks. Data is different. DataOps tweaks DevOps methodology as data observability does to observability. Consider this manifesto as a proposal for taking the proven concepts of chaos engineering and applying them to the eccentric world of data reliability. The 5 Laws of Data Chaos Engineering The principles and lessons of chaos engineering are a good place to start defining the contours of a data chaos engineering discipline. Our first law combines two of the most important. 1. Have a Bias for Production, But Minimize the Blast Radius There is a maxim among site reliability engineers that will ring true for every data engineer who has had the pleasure of the same SQL query returning two different results across staging and production environments. That is, "Nothing acts like prod except for prod." To that, I would add "production data too." Data is just too creative and fluid for humans to anticipate. Synthetic data has come a long way, and don't get me wrong, it can be a piece of the puzzle, but it's unlikely to simulate key edge cases. Like me, the mere thought of introducing points of failure into production systems probably makes your stomach churn. It's terrifying. Some data engineers justifiably wonder, "Is this even necessary within a modern data stack where so many tools abstract the underlying infrastructure?" I'm afraid so. Remember, as the opening anecdote and J-Kidd's snapped ligaments illustrated, the elasticity of the cloud is not a cure-all. In fact, it's that abstraction and opacity–along with the multiple integration points–that makes it so important to stress test a modern data stack. An on-premise database may be more limiting, but data teams tend to understand its thresholds as they hit them more regularly during day-to-day operations. Let's move past the philosophical objections for the moment and dive into the practical. Data is different. Introducing fake data into a system won't be helpful because the input changes the output. It's going to get really messy too. That's where the second part of the law comes into play: minimize the blast radius. There is a spectrum of chaos and tools that can be used: In words only, "let's say this failed; what would we do?" Synthetic data in production. Techniques like data diff allow you to test snippets of SQL code on production data. Solutions like LakeFS allow you to do this on a bigger scale by creating "chaos branches" or complete snapshots of your production environment where you can use production data but with complete isolation. Do it in prod, and practice your backfilling skills. After all, nothing acts like prod but prod. Starting with lesser chaotic scenarios is probably a good idea and will help you understand how to minimize the blast radius in production. Deep diving into real production incidents is also a great place to start. But does everyone really understand what exactly happened? Production incidents are chaos experiments that you've already paid for, so make sure that you are getting the most out of them. Mitigating the blast radius may also include strategies like backing up applicable systems or having data observability or data quality monitoring solution in place to assist with the detection and resolution of data incidents. 2. Understand It's Never a Perfect Time (Within Reason) Another chaos engineering principle holds to observe and understand "steady state behavior." There is wisdom in this principle, but it is also important to understand the field of data engineering isn't quite ready to measure by the standard of "5 9s" or 99.999% uptime. Data systems are constantly in flux, and there is a wider range of "steady state behavior." As a result, there will be the temptation to delay the introduction of chaos until you've reached the mythical point of "readiness." Unfortunately, you can't out-architect bad data; no one is ever ready for chaos. The Silicon Valley cliche of failing fast is applicable here. Or, to paraphrase Reid Hoffman, if you aren't embarrassed by the results of your first post-mortem/fire drill/chaos-introducing event, you introduced it too late. Introducing fake data incidents while you are dealing with real ones may seem silly. Still, ultimately this can help you get ahead by better understanding where you have been putting bandaids on larger issues that may need to be refactored. 3. Formulate Hypotheses and Identify Variables at the System, Code, and Data Levels Chaos engineering encourages forming hypotheses of how systems will react to understand what thresholds to monitor. It also encourages leveraging or mimicking past real-world incidents or likely incidents. We'll dive deeper into the details of this in the next article, but the important modification here is to ensure these span the system, code, and data levels. Variables at each level can create data incidents, some quick examples: System: You didn't have the right permissions set in your data warehouse. Code: A bad left JOIN. Data: A third-party sent you garbage columns with a bunch of NULLS. Simulating increased traffic levels and shutting down servers impact data systems, and those are important tests but don't neglect some of the more unique and fun ways data systems can break badly. 4. Everyone in One Room (Or at Least Zoom Call) This law is based on the experience of my colleague, site reliability engineer, and chaos practitioner Tim Tischler. "Chaos engineering is just as much about people as it is systems. They evolve together, and they can't be separated. Half of the value from these exercises comes from putting all the engineers in a room and asking, 'what happens if we do X or if we do Y?' You are guaranteed to get different answers. Once you simulate the event and see the result, now everyone's mental maps are aligned. That is incredibly valuable," he said. Also, the interdependence of data systems and responsibilities creates blurry lines of ownership, even on the most well-run teams. As a result, breaks often happen and are overlooked in those overlaps and gaps in responsibility where the data engineer, analytical engineer, and data analyst point at each other. In many organizations, the product engineers creating the data and the data engineers managing it are separated and siloed by team structures. They also often have different tools and models of the same system and data. Feel free to pull these product engineers in as well, especially when the data has been generated from internally built systems. Good incident management and triage can often involve multiple teams, and having everyone in one room can make the exercise more productive. I'll also add from personal experience that these exercises can be fun (in the same weird way putting all your chips on red is fun). I'd encourage data teams to consider a chaos data engineering fire drill or pre-mortem event at the next offsite. It makes for a much more practical team bonding exercise than getting out of an escape room. 5. Hold Off on the Automation for Now Truly mature chaos engineering programs like Netflix's Simian Army are automated and even unscheduled. While this may create a more accurate simulation, the reality is that automated tools don't currently exist for data engineering. Furthermore, if they did, I'm unsure if I would be brave enough to use them. To this point, one of the original Netflix chaos engineers has described how they didn't always use automation as the chaos could create more problems than they could fix (especially in collaboration with those running the system) in a reasonable period. Given data engineering's current reliability evolution and the greater potential for an unintentionally large blast radius, I would recommend data teams lean more towards scheduled, carefully managed events. Practice as You Play The important takeaway from the concept of chaos engineering is that practice and simulations are vital to performance and reliability. In my next article, I'll discuss specific things that can be broken at the system, code, and data level and what teams may find out about those systems by pushing them to their limits.
Joana Carvalho
Performance Engineer,
Postman
Greg Leffler
Observability Practitioner, Director,
Splunk
Ted Young
Director of Open Source Development,
LightStep
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere