DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report

Performance

Performance refers to how well an application conducts itself compared to an expected level of service. Today's environments are increasingly complex and typically involve loosely coupled architectures, making it difficult to pinpoint bottlenecks in your system. Whatever your performance troubles, this Zone has you covered with everything from root cause analysis, application monitoring, and log management to anomaly detection, observability, and performance testing.

icon
Latest Refcards and Trend Reports
Trend Report
Performance and Site Reliability
Performance and Site Reliability
Refcard #385
Observability Maturity Model
Observability Maturity Model
Refcard #368
Getting Started With OpenTelemetry
Getting Started With OpenTelemetry

DZone's Featured Performance Resources

Assessment of Scalability Constraints (and Solutions)

Assessment of Scalability Constraints (and Solutions)

By Shai Almog CORE
This is an article from DZone's 2023 Software Integration Trend Report.For more: Read the Report Our approach to scalability has gone through a tectonic shift over the past decade. Technologies that were staples in every enterprise back end (e.g., IIOP) have vanished completely with a shift to approaches such as eventual consistency. This shift introduced some complexities with the benefit of greater scalability. The rise of Kubernetes and serverless further cemented this approach: spinning a new container is cheap, turning scalability into a relatively simple problem. Orchestration changed our approach to scalability and facilitated the growth of microservices and observability, two key tools in modern scaling. Horizontal to Vertical Scaling The rise of Kubernetes correlates with the microservices trend as seen in Figure 1. Kubernetes heavily emphasizes horizontal scaling in which replications of servers provide scaling as opposed to vertical scaling in which we derive performance and throughput from a single host (many machines vs. few powerful machines). Figure 1: Google Trends chart showing correlation between Kubernetes and microservice (Data source: Google Trends ) In order to maximize horizontal scaling, companies focus on the idempotency and statelessness of their services. This is easier to accomplish with smaller isolated services, but the complexity shifts in two directions: Ops – Managing the complex relations between multiple disconnected services Dev – Quality, uniformity, and consistency become an issue. Complexity doesn't go away because of a switch to horizontal scaling. It shifts to a distinct form handled by a different team, such as network complexity instead of object graph complexity. The consensus of starting with a monolith isn't just about the ease of programming. Horizontal scaling is deceptively simple thanks to Kubernetes and serverless. However, this masks a level of complexity that is often harder to gauge for smaller projects. Scaling is a process, not a single operation; processes take time and require a team. A good analogy is physical traffic: we often reach a slow junction and wonder why the city didn't build an overpass. The reason could be that this will ease the jam in the current junction, but it might create a much bigger traffic jam down the road. The same is true for scaling a system — all of our planning might make matters worse, meaning that a faster server can overload a node in another system. Scalability is not performance! Scalability vs. Performance Scalability and performance can be closely related, in which case improving one can also improve the other. However, in other cases, there may be trade-offs between scalability and performance. For example, a system optimized for performance may be less scalable because it may require more resources to handle additional users or requests. Meanwhile, a system optimized for scalability may sacrifice some performance to ensure that it can handle a growing workload. To strike a balance between scalability and performance, it's essential to understand the requirements of the system and the expected workload. For example, if we expect a system to have a few users, performance may be more critical than scalability. However, if we expect a rapidly growing user base, scalability may be more important than performance. We see this expressed perfectly with the trend towards horizontal scaling. Modern Kubernetes systems usually focus on many small VM images with a limited number of cores as opposed to powerful machines/VMs. A system focused on performance would deliver better performance using few high-performance machines. Challenges of Horizontal Scale Horizontal scaling brought with it a unique level of problems that birthed new fields in our industry: platform engineers and SREs are prime examples. The complexity of maintaining a system with thousands of concurrent server processes is fantastic. Such a scale makes it much harder to debug and isolate issues. The asynchronous nature of these systems exacerbates this problem. Eventual consistency creates situations we can't realistically replicate locally, as we see in Figure 2. When a change needs to occur on multiple microservices, they create an inconsistent state, which can lead to invalid states. Figure 2: Inconsistent state may exist between wide-sweeping changes Typical solutions used for debugging dozens of instances don't apply when we have thousands of instances running concurrently. Failure is inevitable, and at these scales, it usually amounts to restarting an instance. On the surface, orchestration solved the problem, but the overhead and resulting edge cases make fixing such problems even harder. Strategies for Success We can answer such challenges with a combination of approaches and tools. There is no "one size fits all," and it is important to practice agility when dealing with scaling issues. We need to measure the impact of every decision and tool, then form decisions based on the results. Observability serves a crucial role in measuring success. In the world of microservices, there's no way to measure the success of scaling without such tooling. Observability tools also serve as a benchmark to pinpoint scalability bottlenecks, as we will cover soon enough. Vertically Integrated Teams Over the years, developers tended to silo themselves based on expertise, and as a result, we formed teams to suit these processes. This is problematic. An engineer making a decision that might affect resource consumption or might impact such a tradeoff needs to be educated about the production environment. When building a small system, we can afford to ignore such issues. Although as scale grows, we need to have a heterogeneous team that can advise on such matters. By assembling a full-stack team that is feature-driven and small, the team can handle all the different tasks required. However, this isn't a balanced team. Typically, a DevOps engineer will work with multiple teams simply because there are far more developers than DevOps. This is logistically challenging, but the division of work makes more sense in this way. As a particular microservice fails, responsibilities are clear, and the team can respond swiftly. Fail-Fast One of the biggest pitfalls to scalability is the fail-safe approach. Code might fail subtly and run in non-optimal form. A good example is code that tries to read a response from a website. In a case of failure, we might return cached data to facilitate a failsafe strategy. However, since the delay happens, we still wait for the response. It seems like everything is working correctly with the cache, but the performance is still at the timeout boundaries. This delays the processing. With asynchronous code, this is hard to notice and doesn't put an immediate toll on the system. Thus, such issues can go unnoticed. A request might succeed in the testing and staging environment, but it might always fall back to the fail-safe process in production. Failing fast includes several advantages for these scenarios: It makes bugs easier to spot in the testing phase. Failure is relatively easy to test as opposed to durability. A failure will trigger fallback behavior faster and prevent a cascading effect. Problems are easier to fix as they are usually in the same isolated area as the failure. API Gateway and Caching Internal APIs can leverage an API gateway to provide smart load balancing, caching, and rate limiting. Typically, caching is the most universal performance tip one can give. But when it comes to scale, failing fast might be even more important. In typical cases of heavy load, the division of users is stark. By limiting the heaviest users, we can dramatically shift the load on the system. Distributed caching is one of the hardest problems in programming. Implementing a caching policy over microservices is impractical; we need to cache an individual service and use the API gateway to alleviate some of the overhead. Level 2 caching is used to store database data in RAM and avoid DB access. This is often a major performance benefit that tips the scales, but sometimes it doesn't have an impact at all. Stack Overflow recently discovered that database caching had no impact on their architecture, and this was because higher-level caches filled in the gaps and grabbed all the cache hits at the web layer. By the time a call reached the database layer, it was clear this data wasn't in cache. Thus, they always missed the cache, and it had no impact. Only overhead. This is where caching in the API gateway layer becomes immensely helpful. This is a system we can manage centrally and control, unlike the caching in an individual service that might get polluted. Observability What we can't see, we can't fix or improve. Without a proper observability stack, we are blind to scaling problems and to the appropriate fixes. When discussing observability, we often make the mistake of focusing on tools. Observability isn't about tools — it's about questions and answers. When developing an observability stack, we need to understand the types of questions we will have for it and then provide two means to answer each question. It is important to have two means. Observability is often unreliable and misleading, so we need a way to verify its results. However, if we have more than two ways, it might mean we over-observe a system, which can have a serious impact on costs. A typical exercise to verify an observability stack is to hypothesize common problems and then find two ways to solve them. For example, a performance problem in microservice X: Inspect the logs of the microservice for errors or latency — this might require adding a specific log for coverage. Inspect Prometheus metrics for the service. Tracking a scalability issue within a microservices deployment is much easier when working with traces. They provide a context and a scale. When an edge service runs into an N+1 query bug, traces show that almost immediately when they're properly integrated throughout. Segregation One of the most important scalability approaches is the separation of high-volume data. Modern business tools save tremendous amounts of meta-data for every operation. Most of this data isn't applicable for the day-to-day operations of the application. It is meta-data meant for business intelligence, monitoring, and accountability. We can stream this data to remove the immediate need to process it. We can store such data in a separate time-series database to alleviate the scaling challenges from the current database. Conclusion Scaling in the age of serverless and microservices is a very different process than it was a mere decade ago. Controlling costs has become far harder, especially with observability costs which in the case of logs often exceed 30 percent of the total cloud bill. The good news is that we have many new tools at our disposal — including API gateways, observability, and much more. By leveraging these tools with a fail-fast strategy and tight observability, we can iteratively scale the deployment. This is key, as scaling is a process, not a single action. Tools can only go so far and often we can overuse them. In order to grow, we need to review and even eliminate unnecessary optimizations if they are not applicable. This is an article from DZone's 2023 Software Integration Trend Report.For more: Read the Report More
The Power of Zero-Knowledge Proofs: Exploring the New ConsenSys zkEVM

The Power of Zero-Knowledge Proofs: Exploring the New ConsenSys zkEVM

By John Vester CORE
It’s well-known that Ethereum needs support in order to scale. A variety of L2s (layer twos) have launched or are in development to improve Ethereum’s scalability. Among the most popular L2s are zero-knowledge-based rollups (also known as zk-rollups). Zk-rollups offer a solution that has both high scalability and minimal costs. In this article, we’ll define what zk-rollups are and review the latest in the market, the new ConsenSys zkEVM. This new zk-rollup—a fully EVM-equivalent L2 by ConsenSys— makes building with zero-knowledge proofs easier than ever. ConsenSys achieves this by allowing developers to port smart contracts easily, stay with the same toolset they already use, and bring users along with them smoothly—all while staying highly performant and cost-effective. If you don’t know a lot about zk-rollups, you’ll find how they work fascinating. They’re at the cutting edge of computer science. And if you do already know about zk-rollups, and you’re a Solidity developer, you’ll be interested in how the new ConsenSys zkEVM makes your dApp development a whole lot easier. It’s zk-rollup time! So let’s jump in. The Power of Zero-Knowledge Proofs Zk-rollups depend on zero-knowledge proofs. But what is a zero-knowledge proof? A zero-knowledge proof allows you to prove a statement is true—without sharing what the actual statement is, or how the truth was discovered. At its most basic, a prover passes secret information to an algorithm to compute the zero-knowledge proof. Then a verifier uses this proof with another algorithm to check that the prover actually knows the secret information. All this happens without revealing the actual information. There are a lot of details behind that above statement. Check out this article if you want to understand the cryptographic magic behind how it all works. But for our purpose, what’s important are the use cases of zero-knowledge proofs. A few examples: Anonymous payments—Traditional digital payments are not private, and even most crypto payments are on public blockchains. Zero-knowledge proofs offer a way to make truly private transactions. You can prove you paid for something … without revealing any details of the transaction. Identity protection—With zero-knowledge proofs, you can prove details of your personal identity while still keeping them private. For example, you can prove citizenship … without revealing your passport. And the most important use case for our purposes: Verifiable computation. What Is Verifiable Computation? Verifiable computation means you can have some other entity process computations for you and trust that the results are true … without knowing any of the details of the transaction. That means a layer 2 blockchain, such as the ConsenSys zkEVM, can become the outsourced computation layer for Ethereum. It can process a batch of transactions (much faster than Ethereum), create the proof for the validity of the transactions, and submit just the results and the proof to Ethereum. Ethereum, since it has the proof, doesn’t need the details—nor does it need a way to prove that the results are true. So instead of processing every transaction, Ethereum offloads the work to a separate chain. All Ethereum has to do is apply the results to its state. This vastly improves the speed and scalability of Ethereum. Exploring the New ConsenSys zkEVM and Why It’s Important Several zk-rollup L2s for Ethereum have already been released or are in progress. But the ConsenSys zkEVM could be the king. Let’s look at why: Type 2 ZK-EVM For one thing, it’s a Type 2 ZK-EVM—an evolution of zk-rollups. It’s faster and easier to use than Type 1 zk solutions. It offers better scalability and performance while still being fully EVM-equivalent. Traditionally with zk-proofs, it’s computationally expensive and slow for the prover to create proofs, which limits the capabilities and usefulness of the rollup. However, the ConsenSys zkEVM uses a recursion-friendly, lattice-based zkSNARK prover—which means faster finality and seamless withdraws, all while retaining the security of Ethereum settlements. And it delivers ultra-low gas fees. Solves the Problems of Traditional L2s Second, the ConsenSys zkEVM solves many of the practical problems of other L2s: Zero switching costs - It’s super easy to port smart contracts to the zkEVM. The zkEVM is EVM-equivalent down to the bytecode. So no rewriting code or smart contracts. You already know what you need to know to get started, and your current smart contracts already work. Easy to move your dApp users to the L2 - The zkEVM is supported by MetaMask, the leading web3 wallet. So most of your users are probably already able to access the zkEVM. Easy for devs - The zkEVM supports most popular tools out of the box. You can build, test, debug, and deploy your smart contracts with Hardhat, Infura, Truffle, etc. All the tools you use now, you can keep using. And there is already a bridge to move tokens onto and off the network. It uses ETH for gas - There’s no native token to the zkEVM, so you don’t need to worry about new tokens, third-party transpilers, or custom middleware. It’s all open source! How To Get Started Using the ConsenSys zkEVM The zkEVM private testnet was released in December 2022 and is moving to public testnet on March 28th, 2023. It’s already processed 774,000 transactions(and growing). There are lots of dApps already: uniswap, the graph, hop, and others. You can read the documentation for the zkEVM and deploy your own smart contract. Conclusion It’s definitely time for zk-rollups to shine. They are evolving quickly and leading the way in helping Ethereum to scale. It’s a great time to jump in and learn how they work—and building with the ConsenSys zkEVM is a great place to start! Have a really great day! More
11 Observability Tools You Should Know
11 Observability Tools You Should Know
By Lahiru Hewawasam
Testing Your Monitoring Configurations
Testing Your Monitoring Configurations
By Phil Wilkins
Green Software and Carbon Hack
Green Software and Carbon Hack
By Beste Bayhan
Getting Started With Prometheus Workshop: Introduction to Prometheus
Getting Started With Prometheus Workshop: Introduction to Prometheus

Are you looking to get away from proprietary instrumentation? Are you interested in open-source observability but lack the knowledge to just dive right in? If so, this workshop is for you, designed to expand your knowledge and understanding of open-source observability tooling that is available to you today. Dive right into a free, online, self-paced, hands-on workshop introducing you to Prometheus. Prometheus is an open-source systems monitoring and alerting tool kit that enables you to hit the ground running with discovering, collecting, and querying your observability today. Over the course of this workshop, you will learn what Prometheus is and is not, install it, start collecting metrics, and learn all the things you need to know to become effective at running Prometheus in your observability stack. In this article, you'll be introduced to some basic concepts and learn what Prometheus is and is not before you start getting hands-on with it in the rest of the workshop. Introduction to Prometheus I'm going to get you started on your learning path with this first lab that provides a quick introduction to all things needed for metrics monitoring with Prometheus. Note this article is only a short summary, so please see the complete lab found online here to work through it in its entirety yourself: The following is a short overview of what is in this specific lab of the workshop. Each lab starts with a goal. In this case, it is fairly simple: This lab introduces you to the Prometheus project and provides you with an understanding of its role in the cloud-native observability community. The start is with background on the beginnings of the Prometheus project and how it came to be part of the Cloud Native Computing Foundation (CNCF) as a graduated project. This leads to some basic outlining of what a data point is, how they are gathered, what makes them a metric and all using a high-level metaphor. You are then walked through what Prometheus is, why we are looking at this project as an open-source solution for your cloud-native observability solution, and more importantly, what Prometheus can not do for you. A basic architecture is presented walking you through the most common usage and components of a Prometheus metrics deployment. Below you see the final overview of the Prometheus architecture: You are then presented with an overview of all the powerful features and tools you'll find in your new Prometheus toolbox: Dimensional data model - For multi-faceted tracking of metrics Query language - PromQL provides a powerful syntax to gather flexible answers across your gathered metrics data. Time series processing - Integration of metrics time series data processing and alerting Service discovery - Integrated discovery of systems and services in dynamic environments Simplicity and efficiency - Operational ease combined with implementation in Go language Finally, you'll touch on the fact that Prometheus has a very simple design and functioning principle and that this has an impact on running it as a highly available (HA) component in your architecture. This aspect is only briefly touched upon, but don't worry: we cover this in more depth later in the workshop. At the end of each lab, including this one, you are presented with the end state (in this case we have not yet done anything), a list of references for further reading, a list of ways to contact me for questions, and a link to the next lab. Missed Previous Labs? This is one lab in the more extensive free online workshop. Feel free to start from the very beginning of this workshop here if you missed anything previously: You can always proceed at your own pace and return any time you like as you work your way through this workshop. Just stop and later restart Perses to pick up where you left off. Coming up Next I'll be taking you through the following lab in this workshop where you'll learn how to install and set up Prometheus on your own local machine. Stay tuned for more hands-on material to help you with your cloud-native observability journey.

By Eric D. Schabell CORE
Isolating Noisy Neighbors in Distributed Systems: The Power of Shuffle-Sharding
Isolating Noisy Neighbors in Distributed Systems: The Power of Shuffle-Sharding

Effective resource management is essential to ensure that no single client or task monopolizes resources and causes performance issues for others. Shuffle-sharding is a valuable technique to achieve this. By dividing resources into equal segments and periodically shuffling them, shuffle-sharding can distribute resources evenly and prevent any client or task from relying on a specific segment for too long. This technique is especially useful in scenarios with a risk of bad actors or misbehaving clients or tasks. In this article, we'll explore shuffle-sharding in-depth, discussing how it balances resources and improves overall system performance. Model Before implementing shuffle-sharding, it's important to understand its key dimensions, parameters, trade-offs, and potential outcomes. Building a model and simulating different scenarios can help you develop a deeper understanding of how shuffle-sharding works and how it may impact your system's performance and availability. That's why we'll explore shuffle-sharding in more detail, using a Colab notebook as our playground. We'll discuss its benefits, limitations, and the factors to consider before implementing it. By the end of this post, you'll have a better idea of what shuffle-sharding can and can't do and whether it's a suitable technique for your specific use case. In practical applications, shuffle-sharding is often used to distribute available resources evenly among different queries or tasks. This can involve mapping different clients or connections to subsets of nodes or containers or assigning specific cores to different query types (or 'queries' to be short). In our simulation, we linked queries to CPU cores. The goal is to ensure that the available CPU resources are shared fairly among all queries, preventing any query from taking over the resources and negatively impacting the performance of others. To achieve this, each query is limited to only 25% of the available cores, and no two queries have more than one core in common. This helps to minimize overlap between queries and prevent any one query from consuming more than its fair share of resources. Here is a visualization of how the cores (columns) are allocated to each query type (rows) and how overlap between them is minimized (each query has exactly three cores assigned): The maximum overlap between rows is just one bit (i.e., 33% of the assigned cores), and the average overlap is ~0.5 bits (less than 20% or assigned cores). This means that even if one query type were to take over 100% of the allocated cores, the others would still have enough capacity to run, unlike uniform assignment, where a rogue query could monopolize the whole node CPU. To evaluate the impact of different factors on the performance of the system, we conducted four simulations, each with different dimensions: Uniform query assignment, where any query type can be assigned to any core, vs. shuffle-sharding assignment, where queries are assigned based on shuffle-sharding principles. Baseline, where all queries are well-behaved, vs. the presence of a bad query type that takes 100% of the CPU resources and never completes. Let's take a look at the error rate (which doesn't include the bad query type as it fails in 100% of cases): Looking at the error rate plot, we can observe that the Baseline Uniform scenario has a slightly higher saturation point than the Baseline Shuffle-Sharding scenario, reaching around a 5% higher query rate before the system starts to degrade. This is expected as shuffle-sharding partitions the CPU cores into smaller sections, which can reduce the efficiency of the resource allocation when the system is near its full capacity. However, when comparing the performance of Uniform vs. Shuffle-Sharding in the presence of a noisy neighbor that seizes all the available resources, we see that Shuffle-Sharding outperforms Uniform by approximately 25%. This demonstrates that the benefits of shuffle-sharding in preventing resource taking over and ensuring fair resource allocation outweigh the minor reduction in efficiency under normal operating conditions. In engineering, trade-offs are a fact of life, and shuffle-sharding is no exception. While it may decrease the saturation point during normal operations, it significantly reduces the risk of outages when things don't go as planned — which is inevitable sooner or later. System Throughput In addition to error rates, another key metric for evaluating the performance of a system is throughput, which measures the number of queries the system can handle depending on the QPS rate. To analyze the system's throughput, we looked at the same data from a different angle. In the plot below, we can see a slight difference between the Baseline Uniform and Baseline Shuffle-Sharding scenarios, where Uniform slightly outperforms Sharding at low QPS rates. However, the difference becomes much more significant when we introduce a faulty client/task/query that monopolizes all the available resources. In this scenario, Shuffle-Sharding outperforms Uniform by a considerable margin: Latency Now let's look at the latency graphs, which show the average, median (p50), and p90 latency of the different scenarios. In the Uniform scenario, we can see that the latency of all requests approaches the timeout threshold pretty quickly at all levels. This demonstrates that resource monopolization can have a significant impact on the performance of the entire system, even for well-behaved queries: In the Sharding scenario, we can observe that the system handles the situation much more effectively and keeps the latency of well-behaving queries as if nothing happened until it reaches a saturation point, which is very close to the total system capacity. This is an impressive result, highlighting the benefits of shuffle-sharding in isolating the latency impact of a noisy/misbehaving neighbor. CPU Utilization At the heart of shuffle-sharding is the idea of distributing resources to prevent the whole ship from sinking, but only allowing a section to become flooded. To illustrate this concept, let's look at the simulated CPU data. In the Uniform simulation, CPU saturation occurs almost instantly, even with low QPS rates. This highlights how resource monopolization can significantly impact system performance, even under minimal load. However, in the Sharding simulation, the system maintains consistent and reliable performance, even under challenging conditions. These simulation results align with the latency and error graphs we saw earlier — the bad actor was isolated and only impacted 25% of the system's capacity, leaving the remaining 75% available for well-behaved queries. Closing Thoughts In conclusion, shuffle-sharding is a valuable technique for balancing limited resources between multiple clients or tasks in distributed systems. Its ability to prevent resource monopolization and ensure fair resource allocation can improve system stability and maintain consistent and reliable performance, even in the presence of faulty clients, tasks, or queries. Additionally, shuffle-sharding can help reduce the blast radius of faults and improve system isolation, highlighting its importance in designing more stable and reliable distributed systems. Of course, in the event of outages, other measures should be applied, such as rate-limiting the offending client/task or moving it to dedicated capacity to minimize system impact. Effective operational practices are critical to maximize the benefits of shuffle-sharding. For other techniques that can be used in conjunction with shuffle-sharding, check out the links below. Also, feel free to play around with the simulation and change the parameters such as the number of query types, cores, etc. to get a sense of the model and how different parameters may affect it. This post continues the theme of improving service performance/availability touched on in previous posts Ensuring Predictable Performance in Distributed Systems, Navigating the Benefits and Risks of Request Hedging for Network Services, and FIFO vs. LIFO: Which Queueing Strategy Is Better for Availability and Latency?.

By Eugene Retunsky
Key Elements of Site Reliability Engineering (SRE)
Key Elements of Site Reliability Engineering (SRE)

Site Reliability Engineering (SRE) is a systematic and data-driven approach to improving the reliability, scalability, and efficiency of systems. It combines principles of software engineering, operations, and quality assurance to ensure that systems meet performance goals and business objectives. This article discusses the key elements of SRE, including reliability goals and objectives, reliability testing, workload modeling, chaos engineering, and infrastructure readiness testing. The importance of SRE in improving user experience, system efficiency, scalability, and reliability, and achieving better business outcomes is also discussed. Site Reliability Engineering (SRE) is an emerging field that seeks to address the challenge of delivering high-quality, highly available systems. It combines the principles of software engineering, operations, and quality assurance to ensure that systems meet performance goals and business objectives. SRE is a proactive and systematic approach to reliability optimization characterized by the use of data-driven models, continuous monitoring, and a focus on continuous improvement. SRE is a combination of software engineering and IT operations, combining the principles of DevOps with a focus on reliability. The goal of SRE is to automate repetitive tasks and to prioritize availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. The benefits of adopting SRE include increased reliability, faster resolution of incidents, reduced mean time to recovery, improved efficiency through automation, and increased collaboration between development and operations teams. In addition, organizations that adopt SRE principles can improve their overall system performance, increase the speed of innovation, and better meet the needs of their customers. SRE 5 Why's 1. Why Is SRE Important for Organizations? SRE is important for organizations because it ensures high availability, performance, and scalability of complex systems, leading to improved user experience and better business outcomes. 2. Why Is SRE Necessary in Today's Technology Landscape? SRE is necessary for today's technology landscape because systems and infrastructure have become increasingly complex and prone to failures, and organizations need a reliable and efficient approach to manage these systems. 3. Why Does SRE Involve Combining Software Engineering and Systems Administration? SRE involves combining software engineering and systems administration because both disciplines bring unique skills and expertise to the table. Software engineers have a deep understanding of how to design and build scalable and reliable systems, while systems administrators have a deep understanding of how to operate and manage these systems in production. 4. Why Is Infrastructure Readiness Testing a Critical Component of SRE? Infrastructure Readiness Testing is a critical component of SRE because it ensures that the infrastructure is prepared to support the desired system reliability goals. By testing the capacity and resilience of infrastructure before it is put into production, organizations can avoid critical failures and improve overall system performance. 5. Why Is Chaos Engineering an Important Aspect of SRE? Chaos Engineering is an important aspect of SRE because it tests the system's ability to handle and recover from failures in real-world conditions. By proactively identifying and fixing weaknesses, organizations can improve the resilience and reliability of their systems, reducing downtime and increasing confidence in their ability to respond to failures. Key Elements of SRE Reliability Metrics, Goals, and Objectives: Defining the desired reliability characteristics of the system and setting reliability targets. Reliability Testing: Using reliability testing techniques to measure and evaluate system reliability, including disaster recovery testing, availability testing, and fault tolerance testing. Workload Modeling: Creating mathematical models to represent system reliability, including Little's Law and capacity planning. Chaos Engineering: Intentionally introducing controlled failures and disruptions into production systems to test their ability to recover and maintain reliability. Infrastructure Readiness Testing: Evaluating the readiness of an infrastructure to support the desired reliability goals of a system. Reliability Metrics In SRE Reliability metrics are used in SRE is used to measure the quality and stability of systems, as well as to guide continuous improvement efforts. Availability: This metric measures the proportion of time a system is available and functioning correctly. It is often expressed as a percentage and calculated as the total uptime divided by the total time the system is expected to be running. Response Time: This measures the time it takes for the infrastructure to respond to a user request. Throughput: This measures the number of requests that can be processed in a given time period. Resource Utilization: This measures the utilization of the infrastructure's resources, such as CPU, memory, Network, Heap, caching, and storage. Error Rate: This measures the number of errors or failures that occur during the testing process. Mean Time to Recovery (MTTR): This metric measures the average time it takes to recover from a system failure or disruption, which provides insight into how quickly the system can be restored after a failure occurs. Mean Time Between Failures (MTBF): This metric measures the average time between failures for a system. MTBF helps organizations understand how reliable a system is over time and can inform decision-making about when to perform maintenance or upgrades. Reliability Testing In SRE Performance Testing: This involves evaluating the response time, processing time, and resource utilization of the infrastructure to identify any performance issues under BAU scenario 1X load. Load Testing: This technique involves simulating real-world user traffic and measuring the performance of the infrastructure under heavy loads 2X Load. Stress Testing: This technique involves applying more load than the expected maximum to test the infrastructure's ability to handle unexpected traffic spikes 3X Load. Chaos or Resilience Testing: This involves simulating different types of failures (e.g., network outages, hardware failures) to evaluate the infrastructure's ability to recover and continue operating. Security Testing: This involves evaluating the infrastructure's security posture and identifying any potential vulnerabilities or risks. Capacity Planning: This involves evaluating the current and future hardware, network, and storage requirements of the infrastructure to ensure it has the capacity to meet the growing demand. Workload Modeling In SRE Workload Modeling is a crucial part of SRE, which involves creating mathematical models to represent the expected behavior of systems. Little's Law is a key principle in this area, which states that the average number of items in a system, W, is equal to the average arrival rate (λ) multiplied by the average time each item spends in the system (T): W = λ * T. This formula can be used to determine the expected number of requests a system can handle under different conditions. Example: Consider a system that receives an average of 200 requests per minute, with an average response time of 2 seconds. We can calculate the average number of requests in the system using Little's Law as follows: W = λ * T W = 200 requests/minute * 2 seconds/request W = 400 requests This result indicates that the system can handle up to 400 requests before it becomes overwhelmed and reliability degradation occurs. By using the right workload modeling, organizations can determine the maximum workload that their systems can handle and take proactive steps to scale their infrastructure and improve reliability and allow them to identify potential issues and design solutions to improve system performance before they become real problems. Tools and techniques used for modeling and simulation: Performance Profiling: This technique involves monitoring the performance of an existing system under normal and peak loads to identify bottlenecks and determine the system's capacity limits. Load Testing: This is the process of simulating real-world user traffic to test the performance and stability of an IT system. Load testing helps organizations identify performance issues and ensure that the system can handle expected workloads. Traffic Modeling: This involves creating a mathematical model of the expected traffic patterns on a system. The model can be used to predict resource utilization and system behavior under different workload scenarios. Resource Utilization Modeling: This involves creating a mathematical model of the expected resource utilization of a system. The model can be used to predict resource utilization and system behavior under different workload scenarios. Capacity Planning Tools: There are various tools available that automate the process of capacity planning, including spreadsheet tools, predictive analytics tools, and cloud-based tools. Chaos Engineering and Infrastructure Readiness in SRE Chaos Engineering and Infrastructure Readiness are important components of a successful SRE strategy. They both involve intentionally inducing failures and stress into systems to assess their strength and identify weaknesses. Infrastructure readiness testing is done to verify the system's ability to handle failure scenarios, while chaos engineering tests the system's recovery and reliability under adverse conditions. The benefits of chaos engineering include improved system reliability, reduced downtime, and increased confidence in the system's ability to handle real-world failures and proactively identify and fix weaknesses; organizations can avoid costly downtime, improve customer experience, and reduce the risk of data loss or security breaches. Integrating chaos engineering into DevOps practices (CI\CD) can ensure their systems are thoroughly tested and validated before deployment. Methods of chaos engineering typically involve running experiments or simulations on a system to stress and test its various components, identify any weaknesses or bottlenecks, and assess its overall reliability. This is done by introducing controlled failures, such as network partitions, simulated resource exhaustion, or random process crashes, and observing the system's behavior and response. Example Scenarios for Chaos Testing Random Instance Termination: Selecting and terminating an instance from a cluster to test the system response to the failure. Network Partition: Partitioning the network between instances to simulate a network failure and assess the system's ability to recover. Increased Load: Increasing the load on the system to test its response to stress and observing any performance degradation or resource exhaustion. Configuration Change: Altering a configuration parameter to observe the system's response, including any unexpected behavior or errors. Database Failure: Simulating a database failure by shutting it down and observing the system's reaction, including any errors or unexpected behavior. By conducting both chaos experiments and infrastructure readiness testing, organizations can deepen their understanding of system behavior and improve their resilience and reliability. Conclusion In conclusion, SRE is a critical discipline for organizations that want to deliver highly reliable, highly available systems. By adopting SRE principles and practices, organizations can improve system reliability, reduce downtime, and improve the overall user experience.

By Srikarthick Vijayakumar
5 Reasons You Need to Care About API Performance Monitoring
5 Reasons You Need to Care About API Performance Monitoring

Connectivity is so daunting. By far, we are all used to instant connectivity that puts the world at our fingertips. We can purchase, post, and pick anything, anywhere, with the aid of desktops and devices. But how does it happen? How do different applications in different devices connect with each other? Allowing us to place an order, plan a vacation, make a reservation, etc., with just a few clicks. API—Application Programming Interface—the unsung hero of the modern world, which is often underrated. What Is an API? APIs are building blocks of online connectivity. They are a medium for multiple applications, data, and devices to interact with each other. Simply put, an API is a messenger that takes a request and tells the system what you want to do and then returns the response to the user. Documentation is drafted for every API, including specifications regarding how the information gets transferred between two systems. Why Is API Important? APIs can interact with third-party applications publicly. Ultimately, upscaling the reach of an organization’s business. So, when we book a ticket via “Bookmyshow.com,” we fill in details regarding the movie we plan to watch, like: Movie name Locality 3D/2D Language These details are fetched by API and are taken to servers associated with different movie theatres to bring back the collected response from multiple third-party servers. Providing users the convenience of choosing which theatre fits best. This is how different applications interact with each other. Instead of making a large application and adding more functionalities via code in it. The present time demands microservice architecture wherein we create multiple individually focused modules with well-defined interfaces and then combine them to make a scalable, testable product. The product or software, which might have taken a year to deliver, can now be delivered in weeks with the help of microservice architecture. API serves as a necessity for microservice architecture. Consider an application that delivers music, shopping, and bill payments service to end users under a single hood. The user needs to log into the app and select the service for consumption. API is needed for collaborating different services for such applications, contributing to an overall enhanced UX. API also enables an extra layer of security to the data. Neither the user’s data is overexposed to the server: nor the server data is overexposed to the user. Say, in the case of movies, API tells the server what the user would like to watch and then the user what they have to give to redeem the service. Ultimately, you get to watch your movie, and the service provider is credited accordingly. API Performance Monitoring and Application Performance Monitoring Differences As similar as these two terms sound, they perform distinctive checks on the overall application connectivity: Application performance monitoring: is compulsory for high-level analytics regarding how well the app is executing on the integral front. It facilitates an internal check on the internal connectivity of software. The following are key data factors that must be monitored: Server loads User adoption Market share Downloads Latency Error logging API performance monitoring: is required to check if there are any bottlenecks outside the server; it could be in the cloud or load balancing service. These bottlenecks are not dependent on your application performance monitoring but are still considered to be catastrophic as they may abrupt the service for end users. It facilitates a check on the external connectivity of the software, aiding its core functionalities: Back-end business operations Alert operations Web services Why Is API Performance Monitoring a Necessity? 1. Functionality With the emergence of modern agile practices, organizations are adopting a virtuous cycle of developing, testing, delivering, and maintaining by monitoring the response. It is integral to involve API monitoring as part of the practice. A script must be maintained in relevance to the appropriate and latest versions of the functional tests for ensuring a flawless experience of services to the end user. Simply put, if your API goes south, your app goes with it. For instance, in January 2016, a worldwide outage was met by the Twitter API. This outage lasted more than an hour, and within that period, it impacted thousands of websites and applications. 2. Performance Organizations are open to performance reckoning if they neglect to thoroughly understand the process involved behind every API call. Also, API monitoring helps acknowledge which APIs are performing better and how to improvise on the APIs with weaker performance displays. 3. Speed/Responsiveness Users can specify the critical API calls in the performance monitoring tool. Set their threshold (acceptable response time) to ensure they get alerted if the expected response time deteriorates. 4. Availability With the help of monitoring, we can realize whether all the services hosted by our applications are accessible 24×7. Why Monitor API When We Can Test it? Well, an API test can be highly composite, considering the large number of multi-steps that get involved. This creates a problem in terms of the frequency required for the test to take place. This is where monitoring steps in! Allowing every hour band check regarding the indispensable aspects. Helping us focus on what’s most vital to our organization. How To Monitor API Performance Identify your dependable APIs—Recognize your employed APIs, whether they are third-party or partner APIs. Internally connecting or externally? Comprehend the functional and transactional use cases for facilitating transparency towards the services being hosted — improves performance and MTTR (Mean Time to Repair). Realize whether you have test cases required to monitor. Whether you have existing test cases that need to be altered, or is there an urgency for new ones to be developed? Know the right tool—API performance monitoring is highly dependent on the tool being used. You need an intuitive, user-friendly, result-optimizing tool with everything packed in. Some commonly well known platforms to perform API performance testing are: CA Technologies(Now Broadcom Inc.) AlertSite Rigor Runscope One more factor to keep a note of is API browser compatibility to realize how well your API can aid different browsers. To know more about this topic, follow our blog about “API and Browser Compatibility.” Conclusion API performance monitoring is a need of modern times that gives you a check regarding the internal as well as external impact of the services hosted by a product. Not everyone cares to bother about APIs, but we are glad you did! Hoping this article will help expand your understanding of the topic. Cheers!

By Harshit Paul
GKE Cluster Optimization: 14 Tactics for a Smoother K8s Deployment
GKE Cluster Optimization: 14 Tactics for a Smoother K8s Deployment

Most engineers don’t want to spend more time than necessary to keep their clusters highly available, secure, and cost-efficient. How do you make sure your Google Kubernetes engine cluster is ready for the storms ahead? Here are fourteen optimization tactics divided into three core areas of your cluster. Use them to build a resource-efficient, highly-available GKE cluster with airtight security. Here are the three core sections in this article: Resource Management Security Networking Resource Management Tips for a GKE Cluster 1. Autoscaling Use the autoscaling capabilities of Kubernetes to make sure your workloads perform well during peak load and control costs in times of normal or low loads. Kubernetes gives you several autoscaling mechanisms. Here’s a quick overview to get you up to speed: Horizontal pod autoscaler: HPA adds or removes pod replicas automatically based on utilization metrics. It works great for scaling stateless and stateful applications. Use it with Cluster Autoscaler to shrink the number of active nodes when the pod number decreases. HPA also comes in handy for handling workloads with short high utilization spikes. Vertical pod autoscaler: VPA increases and lowers the CPU and memory resource requests of pod containers to make sure the allocated and actual cluster usage match. If your HPA configuration doesn’t use CPU or memory to identify scaling targets, it’s best to use it with VPA. Cluster autoscaler: it dynamically scales the number of nodes to match the current GKE cluster utilization. Works great with workloads designed to meet dynamically changing demand. Best Practices for Autoscaling in a GKE Cluster Use HPA, VPA and Node Auto Provisioning (NAP): By using HPA, VPA and NAP together, you let GKE efficiently scale your cluster horizontally (pods) and vertically (nodes). VPA sets values for CPU, memory requests, and limits for containers, while NAP manages node pools and eliminates the default limitation of starting new nodes only from the set of user-created node pools. Check if your HPA and VPA policies clash: Make sure the VPA and HPA policies don’t interfere with each other. For example, if HPA only relies on CPU and memory metrics, HPA and VPA cannot work together. Also, review your bin packing density settings when designing a new GKE cluster for a business-or purpose-class tier of service. Use instance weighted scores: This allows you to determine how much of your chosen resource pool will be dedicated to a specific workload and ensure that your machine is best suited for the job. Slash costs with a mixed-instance strategy: Using mixed instances helps achieve high availability and performance at a reasonable cost. It’s basically about choosing from various instance types, some of which may be cheaper and good enough for lower-throughput or low-latency workloads. Or you could run a smaller number of machines with higher specs.This way it would bring costs down because each node requires Kubernetes to be installed on it, which always adds a little overhead. 2. Choose the Topology for Your GKE Cluster You can choose from two types of clusters: Regional topology: In a regional Kubernetes cluster, Google replicates the control plane and nodes across multiple zones in a single region. Zonal topology: In a zonal cluster, they both run in a single compute zone specified upon cluster creation. If your application depends on the availability of a cluster API, pick a regional cluster topology, which offers higher availability for the cluster’s control plane API. Since it’s the control plane that does jobs like scaling, replacing, and scheduling pods, if it becomes unavailable, you’re in for reliability trouble. On the other hand, regional clusters have nodes spreaded across multiple zones, which may increase your cross-zone network traffic and, thus, costs. 3. Bin Pack Nodes for Maximum Utilization This is a smart approach to GKE cost optimization shared by the engineering team at Delivery Hero. To maximize node utilization, it’s best to add pods to nodes in a compacted way. This opens the door to reducing costs without any impact on performance. This strategy is called bin packing and goes against the Kubernetes that favors even distribution of pods across nodes. Source: Delivery Hero The team at Delivery Hero used GKE Autopilot, but its limitations made the engineers build bin packing on their own. To achieve the highest node utilization, the team defines one or more node pools in a way that allows nodes to include pods in the most compacted way (but leaving some buffer for the shared CPU). By merging node pools and performing bin packing, pods fit into nodes more efficiently, helping Delivery Hero to decrease the total number of nodes by ~60% in that team. 4. Implement Cost Monitoring Cost monitoring is a big part of resource management because it lets you keep an eye on your expenses and instantly act on cost spike alerts. To understand your Google Kubernetes Engine costs better, implement a monitoring solution that gathers data about your cluster’s workload, total cost, costs divided by labels or namespaces, and overall performance. The GKE usage metering enables you to monitor resource usage, map workloads, and estimate resource consumption. Enable it to quickly identify the most resource-intensive workloads or spikes in resource consumption. This step is the bare minimum you can do for cost monitoring. Tracking these 3 metrics is what really makes a difference in how you manage your cloud resources: daily cloud spend, cost per provisioned and requested CPU, and historical cost allocation. 5. Use Spot VMs Spot VMs are an incredible cost-saving opportunity—you can get a discount reaching up to 91% off the pay-as-you-go pricing. The catch is that Google may reclaim the machine at any time, so you need to have a strategy in place to handle the interruption. That’s why many teams use spot VMs for workloads that are fault-and interruption-tolerant like batch processing jobs, distributed databases, CI/CD operations, or microservices. Best Practices for Running Your GKE Cluster on Spot VMs How to choose the right spot VM? Pick a slightly less popular spot VM type—it’s less likely to get interrupted. You can also check its frequency of interruption (the rate at which this instance reclaimed capacity within the trailing month). Set up spot VM groups: This increases your chances of snatching the machines you want. Managed instance groups can request multiple machine types at the same time, adding new spot VMs when extra resources become available. Security Best Practices for GKE Clusters Red Hat 2022 State of Kubernetes and Container Security found that almost 70% of incidents happen due to misconfigurations. GKE secures your Kubernetes cluster in many layers, including the container image, its runtime, the cluster network, and access to the cluster API server. Google generally recommends implementing a layered approach to GKE cluster security. The most important security aspects to focus on are: Authentication and authorization Control plane Node Network security 1. Follow CIS Benchmarks All of the key security areas are part of Center of Internet Security (CIS) Benchmarks, a globally recognized best practices collection that gives you a helping hand for structuring security efforts. When you use a managed service like GKE, you don’t have the power over all the CIS Benchmark items. But some things are definitely within your control, like auditing, upgrading, and securing the cluster nodes and workloads. You can either go through the CIS Benchmarks manually or use a tool that does the benchmarking job for you. We recently introduced a container security module that scans your GKE cluster to check for any benchmark discrepancies and prioritizes issues to help you take action. 2. Implement RBAC Role-Based Access Control (RBAC) is an essential component for managing access to your GKE cluster. It lets you establish more granular access to Kubernetes resources at cluster and namespace levels, and develop detailed permission policies. CIS GKE Benchmark 6.8.4 emphasizes that teams give preference to RBAC over the legacy Attribute-Based Access Control (ABAC). Another CIS GKE Benchmark (6.8.3) suggests using groups for managing users. This is how you make controlling identities and permissions simpler and don’t need to update the RBAC configuration whenever you add or remove users from the group. 3. Follow the Principle of Least Privilege Make sure to grant user accounts only the privileges that are essential for them to do their jobs. Nothing more than that. The CIS GKE Benchmark 6.2.1 states: Prefer not running GKE clusters using the Compute Engine default service account. By default, nodes get access to the Compute Engine service account. This comes in handy for multiple applications but opens the door to more permissions than necessary to run your GKE cluster. Create and use a minimally privileged service account instead of the default—and follow the same principle everywhere else. 4. Boost Your Control Plane’s Security Google implements the Shared Responsibility Model to manage the GKE control plane components. Still, you’re the one responsible for securing nodes, containers, and pods. The Kubernetes API server uses a public IP address by default. You can secure it with the help of authorized networks and private Kubernetes clusters that let you assign a private IP address. Another way to improve your control plane’s security is performing a regular credential rotation. The TLS certificates and cluster certificate authority get rotated automatically when you initiate the process. 5. Protect Node Metadata CIS GKE Benchmarks 6.4.1 and 6.4.2 point out two critical factors that may compromise your node security—and fall on your plate. Kubernetes deprecated the v0.1 and v1beta1 Compute Engine metadata server endpoints in 2020. The reason was that they didn’t enforce metadata query headers. Some attacks against Kubernetes clusters rely on access to the metadata server of virtual machines. The idea here is to extract credentials. You can fight such attacks with workload identity or metadata concealment. 6. Upgrade GKE Regularly Kubernetes often releases new security features and patches, so keeping your deployment up-to-date is a simple but powerful approach to improving your security posture. The good news about GKE is that it patches and upgrades the control plane automatically. The node auto-upgrade also upgrades cluster nodes and CIS GKE Benchmark 6.5.3 recommends that you keep this setting on. If you want to disable the auto-upgrade for any reason, Google suggests performing upgrades on a monthly basis and following the GKE security bulletins for critical patches. Networking Optimization Tips for Your GKE Cluster 1. Avoid Overlaps With IP Addresses From Other Environments When designing a larger Kubernetes cluster, keep in mind to avoid overlaps with IP addresses used in your other environments. Such overlaps might cause issues with routing if you need to connect cluster VPC network to on-premises environments or other cloud service provider networks via Cloud VPN or Cloud Interconnect. 2. Use GKE Dataplane V2 and Network Policies If you want to control traffic flow at the OSI layer 3 or 4 (IP address or port level), you should consider using network policies. Network policies allow specifying how a pod can communicate with other network entities (pods, services, certain subnets, etc.). To bring your cluster networking to the next level, GKE Dataplane V2 is the right choice. It’s based on eBPF and provides extended integrated network security and visibility experience. Adding to that, if the cluster uses the Google Kubernetes Engine Dataplane V2, you don’t need to enable network policies explicitly as the former manages services routing, network policy enforcement, and logging. 3. Use Cloud DNS for GKE Pod and Service DNS resolution can be done without the additional overhead of managing the cluster-hosted DNS provider. Cloud DNS for GKE requires no additional monitoring, scaling, or other management activities as it’s a fully hosted Google service. Conclusion In this article, you have learned how to optimize your GKE cluster with fourteen tactics across security, resource management, and networking for high availability and optimal cost. Hopefully, you have taken away some helpful information that will help you in your career as a developer.

By Narunas Kapocius
Configure Kubernetes Health Checks
Configure Kubernetes Health Checks

Kubernetes is an open-source container orchestration platform that helps manage and deploy applications in a cloud environment. It is used to automate the deployment, scaling, and management of containerized applications. It is an efficient way to manage application health with Kubernetes probes. This article will discuss Kubernetes probes, the different types available, and how to implement them in your Kubernetes environment. What Are Kubernetes Probes? Kubernetes probes are health checks that are used to monitor the health of applications and services in a Kubernetes cluster. They are used to detect any potential problems with applications or services and identify potential resource bottlenecks. Probes are configured to run at regular intervals and send a signal to the Kubernetes control plane if they detect any issues with the application or service. Kubernetes probes are typically implemented using the Kubernetes API, which allows them to query the application or service for information. This information can then be used to determine the application’s or service’s health. Kubernetes probes can also be used to detect changes in the application or service and send a notification to the Kubernetes control plane, which can then take corrective action. Kubernetes probes are an important part of the Kubernetes platform, as they help ensure applications and services run smoothly. They can be used to detect potential problems before they become serious, allowing you to take corrective action quickly. A successful message for a readiness probe indicates the container is ready to receive traffic. If a readiness probe is successful, the container is considered ready and can begin receiving requests from other containers, services, or external clients. A successful message for a liveness probe indicates the container is still running and functioning properly. If a liveness probe succeeds, the container is considered alive and healthy. If a liveness probe fails, the container is considered to be in a failed state, and Kubernetes will attempt to restart the container to restore its functionality. Both readiness and liveness probes return a successful message with an HTTP response code of 200-399 or a TCP socket connection is successful. If the probe fails, it will return a non-2xx HTTP response code or a failed TCP connection, indicating that the container is not ready or alive. A successful message for a Kubernetes probe indicates the container is ready to receive traffic or is still running and functioning properly, depending on the probe type. Types of Kubernetes Probes There are three types of probes: Startup probes Readiness probes Liveness probes 1. Startup Probes A startup probe is used to determine if a container has started successfully. This type of probe is typically used for applications that take longer to start up, or for containers that perform initialization tasks before they become ready to receive traffic. The startup probe is run only once, after the container has been created, and it will delay the start of the readiness and liveness probes until it succeeds. If the startup probe fails, the container is considered to have failed to start and Kubernetes will attempt to restart the container. 2. Readiness Probes A readiness probe is used to determine if a container is ready to receive traffic. This type of probe is used to ensure a container is fully up and running and can accept incoming connections before it is added to the service load balancer. A readiness probe can be used to check the availability of an application’s dependencies or perform any other check that indicates the container is ready to serve traffic. If the readiness probe fails, the container is removed from the service load balancer until the probe succeeds again. 3. Liveness Probes A liveness probe is used to determine if a container is still running and functioning properly. This type of probe is used to detect and recover from container crashes or hang-ups. A liveness probe can be used to check the responsiveness of an application or perform any other check that indicates the container is still alive and healthy. If the liveness probe fails, Kubernetes will attempt to restart the container to restore its functionality. Each type of probe has its own configuration options, such as the endpoint to check, the probe interval, and the success and failure thresholds. By using these probes, Kubernetes can ensure containers are running and healthy and can take appropriate action if a container fails to respond. How To Implement Kubernetes Probes Kubernetes probes can be implemented in a few different ways: The first way is to use the Kubernetes API to query the application or service for information. This information can then be used to determine the application’s or service’s health. The second way is to use the HTTP protocol to send a request to the application or service. This request can be used to detect if an application or service is responsive, or if it is taking too long to respond. The third way is to use custom probes to detect specific conditions in an application or service. Custom probes can be used to detect things such as resource usage, slow responses, or changes in the application or service. Once you have decided which type of probe you will be using, you can then configure the probe using the Kubernetes API. You can specify the frequency of the probe, the type of probe, and the parameters of the probe. Once the probe is configured, you can deploy it to the Kubernetes cluster. Today, I’ll show how to configure health checks to your application deployed on Kubernetes with HTTP protocol to check whether the application is ready, live, and starting as per our requirements. Prerequisites A Kubernetes cluster from any cloud provider. You can even use Minikube or Kind to create a single-node cluster. Docker Desktop to containerize the application. Docker Hub to push the container image to the Docker registry. Node.js installed, as we will use a sample Node.js application. Tutorial Fork the sample application here. Get into the main application folder with the command: cd Kubernetes-Probes-Tutorial Install the dependencies with the command: npm install Run the application locally using the command: node app.js You should see the application running on port 3000. In the application folder, you should see the Dockerfile with the following code content: # Use an existing node image as base image FROM node:14-alpine # Set the working directory in the container WORKDIR /app # Copy package.json and package-lock.json to the container COPY package*.json ./ # Install required packages RUN npm install # Copy all files to the container COPY . . # Expose port 3000 EXPOSE 3000 # Start the application CMD [ "npm", "start" ] This Dockerfile is to create a container image of our application and push it to the Docker Hub. Next, build and push your image to the Docker Hub using the following command: docker buildx build --platform=linux/arm64 --platform=linux/amd64 -t docker.io/Docker Hub username/image name:tag --push -f ./Dockerfile . You can see the pushed image on your Docker Hub account under repositories. Next, deploy the manifest files. In the application folder, you will notice a deployment.yaml file with health checks/probes included, such as readiness and liveness probes. Note: we have used our pushed image name in the YAML file: apiVersion: apps/v1 kind: Deployment metadata: name: notes-app-deployment labels: app: note-sample-app spec: replicas: 2 selector: matchLabels: app: note-sample-app template: metadata: labels: app: note-sample-app spec: containers: - name: note-sample-app-container image: pavansa/note-sample-app resources: requests: cpu: "100m" imagePullPolicy: IfNotPresent ports: - containerPort: 3000 readinessProbe: httpGet: path: / port: 3000 livenessProbe: httpGet: path: / port: 3000 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 You can see the image used and the health checks configured in the above YAML file. We are all set with our YAML file. Assuming you have a running cluster ready, let’s deploy the above mentioned manifest file with the command: kubectl apply -f deployment.yaml You should see the successful deployment of the file: “deployment.apps/notes-app-deployment created.” Let’s check the pod status with the following command to make sure the pods are running: kubectl get pods Let’s describe a pod using the following command: kubectl describe pod notes-app-deployment-7fb6f5d74b-hw5fn You can see the “Liveness and Readiness” status when you describe the pods. Next, let’s check the events section. You can see the different events, such as “scheduled,” “pulled,” “created,” and “started.” All the pod events were successful. Conclusion Kubernetes probes are an important part of the Kubernetes platform, as they help ensure applications and services run smoothly. They can be used to detect potential problems before they become serious, allowing you to take corrective action quickly. Kubernetes probes come in two types: Liveness probes Readiness probes Along with custom probes that can be used to detect specific conditions in an application or service. Implementing Kubernetes probes is a straightforward process that can be done using the Kubernetes API. If you are looking for a way to ensure the health of your applications and services, Kubernetes probes are the way to go. So, make sure to implement Kubernetes probes in your Kubernetes environment today!

By Pavan Belagatti CORE
How To Collect and Ship Windows Events Logs With OpenTelemetry
How To Collect and Ship Windows Events Logs With OpenTelemetry

If you use Windows, you will want to monitor Windows Events. A recent contribution of a distribution of the OpenTelemetry (OTel) Collector makes it much easier to monitor Windows Events with OpenTel. You can utilize this receiver either in conjunction with any OTel collector: including the OpenTelemetry Collector. In this article, we will be using observIQ’s distribution of the collector. Below are steps to get up and running quickly with the distribution. We will be shipping Windows Event logs to a popular backend: Google Cloud Ops. You can find out more on the GitHub page here. What Signals Matter? Windows Events logs record many different operating system processes, application activity, and account activity. Some relevant log types you will want to monitor include: Application status: This contains information about applications installed or running on the system. If an application crashes, these logs may contain an explanation for the crash. Security logs: These logs contain information about the system’s audit and authentication processes. For example, if a user attempts to log into the system or use administrator privileges. System logs: These logs contain information about Windows-specific processes, such as driver activity. All of the above categories can be gathered with the Windows Events receiver, so let’s get started. Before You Begin If you don’t already have an OpenTelemetry collector built with the latest Windows Events receiver installed, you’ll need to do that first. The distribution of the OpenTelemetry Collector we’re using today includes the Windows Events receiver (and many others) and can be installed with the one-line installer here. For Linux To install using the installation script, you run: sudo sh -c "$(curl -fsSlL https://github.com/observiq/observiq-otel-collector/releases/latest/download/install_unix.sh)" install_unix.sh To install directly with the appropriate package manager, head here. For Windows To install the collector on Windows, run the Powershell command below to install the MSI with no UI: PowerShell msiexec /i "https://github.com/observIQ/observiq-otel-collector/releases/latest/download/observiq-otel-collector.msi" /quiet Alternately, for an interactive installation download the latest MSI. After downloading the MSI, double-click the “download to open the installation wizard” and follow the instructions to configure and install the collector. For more installation information see installing on Windows. For macOS To install using the installation script, you run: sudo sh -c "$(curl -fsSlL https://github.com/observiq/observiq-otel-collector/releases/latest/download/install_macos.sh)" install_macos.sh For more installation guidance, see installing on macOS. For Kubernetes To deploy the collector on Kubernetes, further documentation can be found at our observiq-otel-collector-k8s repository. Configuring the Windows Events Receiver Now the distribution is installed, let’s navigate to your OpenTelemetry configuration file. If you’re using the observIQ Collector, you’ll find it at the following location: C:\Program Files\observIQ OpenTelemetry Collector\config.yaml (Windows) Edit the configuration file to include the Windows Events receiver as shown below: YAML receivers: windowseventlog: channel: application You can edit the specific output by adding/editing the following directly below the receiver name and channel: YAML { "channel": "Application", "computer": "computer name", "event_id": { "id": 10, "qualifiers": 0 }, "keywords": "[Classic]", "level": "Information", "message": "Test log", "opcode": "Info", "provider": { "event_source": "", "guid": "", "name": "otel" }, "record_id": 12345, "system_time": "2022-04-15T15:28:08.898974100Z", "task": "" } Configuring the Log Fields You can adjust the following fields in the configuration to adjust what types of logs you want to ship: Field Default Description channel required The windows event log channel to monitor. max_reads 100 On first startup, where to start reading logs from the API. Options are beginning or end. start_at end Number of client connections (excluding connections from replicas). poll_interval 1s The interval at which the channel is checked for new log entries. This check begins again after all new bodies have been read. attributes {} A map of key: value pairs to add to the entry's attributes. resource {} A map of key: value pairs to add to the entry’s resource. operators [] An array of operators. See below for more details: converter {max_flush_count: 100,flush_interval: 100ms,worker_count: max(1,runtime.NumCPU()/4)} A map of key: value pairs to configure the [entry.Entry][entry_link] to [pdata.LogRecord][pdata_logrecord_link] converter, more info can be found [here][converter_link] Operators Each operator performs a simple responsibility, such as parsing a timestamp or JSON. Chain together operators to process logs into the desired format: Every operator has a type. Every operator can be given a unique id. If you use the same type of operator more than once in a pipeline, you must specify an id. Otherwise, the id defaults to the value of type. Operators will output to the next operator in the pipeline. The last operator in the pipeline will emit from the receiver. Optionally, the output parameter can be used to specify the id of another operator to which logs will be passed directly. Only parsers and general-purpose operators should be used. Conclusion As you can see, this distribution makes it much simpler to work with OpenTelemetry collector—with a single-line installer, integrated receivers, exporter, and processor pool—and will help you implement OpenTelemetry standards wherever it is needed in your systems.

By Paul Stefanski
How to Monitor Apache Flink With OpenTelemetry
How to Monitor Apache Flink With OpenTelemetry

Apache Flink monitoring support is now available in the open-source OpenTelemetry collector. You can check out the OpenTelemetry repo here! You can utilize this receiver in conjunction with any OTel collector: including the OpenTelemetry Collector and other distributions of the collector. Today we'll use observIQ’s OpenTelemetry distribution, and shipping Apache Flink telemetry to a popular backend: Google Cloud Ops. You can find out more on the GitHub page: https://github.com/observIQ/observiq-otel-collector What Signals Matter? Apache Flink is an open-source, unified batch processing and stream processing framework. The Apache Flink collector records 29 unique metrics, so there is a lot of data to pay attention to. Some specific metrics that users find valuable are: Uptime and restarts Two different metrics that record the duration a job has continued uninterrupted, and the number of full restarts a job has committed, respectively. Checkpoints A number of metrics monitoring checkpoints can tell you the number of active checkpoints, the number of completed and failed checkpoints, and the duration of ongoing and past checkpoints. Memory Usage Memory-related metrics are often relevant to monitor. The Apache Flink collector ships metrics that can tell you about total memory usage, both present and over time, mins and maxes, and how the memory is divided between different processes. All of the above categories can be gathered with the Apache Flink receiver — so let’s get started. Before You Begin If you don’t already have an OpenTelemetry collector built with the latest Apache Flink receiver installed, you’ll need to do that first. The Collector distro we're using includes the Apache Flink receiver (and many others) and is simple to install with a one-line installer. Configuring the Apache Flink Receiver Navigate to your OpenTelemetry configuration file. If you’re following along, you’ll find it in the following location: /opt/observiq-otel-collector/config.yaml (Linux) For the Collector, edit the configuration file to include the Apache Flink receiver as shown below: YAML receivers: flinkmetrics: endpoint: http://localhost:8081 collection_interval: 10s Processors: nop: # Resourcedetection is used to add a unique (host.name) # to the metric resource(s),... target_key: namespace exporters: nop: # Add the exporter for your preferred destination(s) service: pipelines: metrics: receivers: [flinkmetrics] processors: [nop] exporters: [nop] If you’re using the Google Ops Agent instead, you can find the relevant config file here. Viewing the Metrics Collected If you followed the steps detailed above, the following Apache Flink metrics will now be delivered to your preferred destination. Metric Description flink.jvm.cpu.load The CPU usage of the JVM for a jobmanager or taskmanager. flink.jvm.cpu.time The CPU time used by the JVM for a jobmanager or taskmanager. flink.jvm.memory.heap.used The amount of heap memory currently used. flink.jvm.memory.heap.committed The amount of heap memory guaranteed to be available to the JVM. flink.jvm.memory.heap.max The maximum amount of heap memory that can be used for memory management. flink.jvm.memory.nonheap.used The amount of non-heap memory currently used. flink.jvm.memory.nonheap.committed The amount of non-heap memory guaranteed to be available to the JVM. flink.jvm.memory.nonheap.max The maximum amount of non-heap memory that can be used for memory management. flink.jvm.memory.metaspace.used The amount of memory currently used in the Metaspace memory pool. flink.jvm.memory.metaspace.committed The amount of memory guaranteed to be available to the JVM in the Metaspace memory pool. flink.jvm.memory.metaspace.max The maximum amount of memory that can be used in the Metaspace memory pool. flink.jvm.memory.direct.used The amount of memory used by the JVM for the direct buffer pool. flink.jvm.memory.direct.total_capacity The total capacity of all buffers in the direct buffer pool. flink.jvm.memory.mapped.used The amount of memory used by the JVM for the mapped buffer pool. flink.jvm.memory.mapped.total_capacity The number of buffers in the mapped buffer pool. flink.memory.managed.used The amount of managed memory currently used. flink.memory.managed.total The total amount of managed memory. flink.jvm.threads.count The total number of live threads. flink.jvm.gc.collections.count The total number of collections that have occurred. flink.jvm.gc.collections.time The total time spent performing garbage collection. flink.jvm.class_loader.classes_loaded The total number of classes loaded since the start of the JVM. flink.job.restart.count The total number of restarts since this job was submitted, including full restarts and fine-grained restarts. flink.job.last_checkpoint.time The end to end duration of the last checkpoint. flink.job.last_checkpoint.size The total size of the last checkpoint. flink.job.checkpoint.count The number of checkpoints completed or failed. flink.job.checkpoint.in_progress The number of checkpoints in progress. flink.task.record.count The number of records a task has. flink.operator.record.count The number of records an operator has. flink.operator.watermark.output The last watermark this operator has emitted. This OpenTelemetry collector can help companies looking to implement OpenTelemetry standards.

By Jonathan Wamsley
Observability-Driven Development vs Test-Driven Development
Observability-Driven Development vs Test-Driven Development

The concept of observability involves understanding a system’s internal states through the examination of logs, metrics, and traces. This approach provides a comprehensive system view, allowing for a thorough investigation and analysis. While incorporating observability into a system may seem daunting, the benefits are significant. One well-known example is PhonePe, which experienced a 2000% growth in its data infrastructure and a 65% reduction in data management costs with the implementation of a data observability solution. This helped mitigate performance issues and minimize downtime. The impact of Observability-Driven Development (ODD) is not limited to just PhonePe. Numerous organizations have experienced the benefits of ODD, with a 2.1 times higher likelihood of issue detection and a 69% improvement in the mean time to resolution. What Is ODD? Observability-Driven Development (ODD) is an approach to shift left observability to the earliest stage of the software development life cycle. It uses trace-based testing as a core part of the development process. In ODD, developers write code while declaring desired output and specifications that you need to view the system’s internal state and process. It applies at a component level and as a whole system. ODD is also a function to standardize instrumentation. It can be across programming languages, frameworks, SDKs, and APIs. What Is TDD? Test-Driven Development (TDD) is a widely adopted software development methodology that emphasizes the writing of automated tests prior to coding. The process of TDD involves defining the desired behavior of software through the creation of a test case, running the test to confirm its failure, writing the minimum necessary code to make the test pass, and refining the code through refactoring. This cycle is repeated for each new feature or requirement, and the resulting tests serve as a safeguard against potential future regressions. The philosophy behind TDD is that writing tests compels developers to consider the problem at hand and produce focused, well-structured code. Adherence to TDD improves software quality and requirement compliance and facilitates the early detection and correction of bugs. TDD is recognized as an effective method for enhancing the quality, reliability, and maintainability of software systems. Comparison of Observability and Testing-Driven Development Similarities Observability-Driven Development (ODD) and Testing-Driven Development (TDD) strive towards enhancing the quality and reliability of software systems. Both methodologies aim to ensure that software operates as intended, minimizing downtime and user-facing issues while promoting a commitment to continuous improvement and monitoring. Differences Focus: The focus of ODD is to continuously monitor the behavior of software systems and their components in real time to identify potential issues and understand system behavior under different conditions. TDD, on the other hand, prioritizes detecting and correcting bugs before they cause harm to the system or users and verifies software functionality to meet requirements. Time and resource allocation: Implementing ODD requires a substantial investment of time and resources for setting up monitoring and logging tools and infrastructure. TDD, in contrast, demands a significant investment of time and resources during the development phase for writing and executing tests. Impact on software quality: ODD can significantly impact software quality by providing real-time visibility into system behavior, enabling teams to detect and resolve issues before they escalate. TDD also has the potential to significantly impact software quality by detecting and fixing bugs before they reach production. However, if tests are not comprehensive, bugs may still evade detection, potentially affecting software quality. Moving From TDD to ODD in Production Moving from a Test-Driven Development (TDD) methodology to an Observability-Driven Development (ODD) approach in software development is a significant change. For several years, TDD has been the established method for testing software before its release to production. While TDD provides consistency and accuracy through repeated tests, it cannot provide insight into the performance of the entire application or the customer experience in a real-world scenario. The tests conducted through TDD are isolated and do not guarantee the absence of errors in the live application. Furthermore, TDD relies on a consistent production environment for conducting automated tests, which is not representative of real-world scenarios. Observability, on the other hand, is an evolved version of TDD that offers full-stack visibility into the infrastructure, application, and production environment. It identifies the root cause of issues affecting the user experience and product release through telemetry data such as logs, traces, and metrics. This continuous monitoring and tracking help predict the end user’s perception of the application. Additionally, with observability, it is possible to write and ship better code before it reaches the source control, as it is part of the set of tools, processes, and culture. Best Practices for Implementing ODD Here are some best practices for implementing Observability-Driven Development (ODD): Prioritize observability from the outset: Start incorporating observability considerations in the development process right from the beginning. This will help you identify potential issues early and make necessary changes in real time. Embrace an end-to-end approach: Ensure observability covers all aspects of the system, including the infrastructure, application, and end-user experience. Monitor and log everything: Gather data from all sources, including logs, traces, and metrics, to get a complete picture of the system’s behavior. Use automated tools: Utilize automated observability tools to monitor the system in real-time and alert you of any anomalies. Collaborate with other teams: Collaborate with teams, such as DevOps, QA, and production, to ensure observability is integrated into the development process. Continuously monitor and improve: Regularly monitor the system, analyze data, and make improvements as needed to ensure optimal performance. Embrace a culture of continuous improvement: Encourage the development team to embrace a culture of continuous improvement and to continuously monitor and improve the system. Conclusion Both Observability-Driven Development (ODD) and Test-Driven Development (TDD) play an important role in ensuring the quality and reliability of software systems. TDD focuses on detecting and fixing bugs before they can harm the system or its users, while ODD focuses on monitoring the behavior of the software system in real-time to identify potential problems and understand its behavior in different scenarios. Did I miss any of the important information regarding the same? Let me know in the comments section below.

By Hiren Dhaduk
Chaos Data Engineering Manifesto: 5 Laws for Successful Failures
Chaos Data Engineering Manifesto: 5 Laws for Successful Failures

It's midnight in the dim and cluttered office of The New York Times, currently serving as the "situation room." A powerful surge of traffic is inevitable. During every major election, the wave would crest and crash against our overwhelmed systems before receding, allowing us to assess the damage. We had been in the cloud for years, which helped some. Our main systems would scale– our articles were always served– but integration points across backend services would eventually buckle and burst under the sustained pressure of insane traffic levels. However, this night in 2020 differed from similar election nights in 2014, 2016, and 2018. That's because this traffic surge was simulated, and an election wasn't happening. Pushing to the Point of Failure Simulation or not, this was prod, so the stakes were high. There was suppressed horror as J-Kidd–our system that brought ad targeting parameters to the front end–went down hard. It was as if all the ligaments had been ripped from the knees of the pass-first point guard for which it had been named. Ouch. I'm sorry, Jason; it was for the greater good. J-Kidd wasn't the only system that found its way to the disabled list. That was the point of the whole exercise, to push our systems until they failed. We succeeded. Or failed, depending on your point of view. The next day the team made adjustments. We decoupled systems, implemented failsafes, and returned to the court for game 2. As a result, the 2020 election was the first I can remember where the on-call engineers weren't on the edge of their seats, white-knuckling their keyboards…At least not for system reliability reasons. Pre-Mortems and Chaos Engineering We referred to that exercise as a "premortem." Its conceptual roots can be traced back to the idea of chaos engineering introduced by site reliability engineers. For those unfamiliar, chaos engineering is a disciplined methodology for intentionally introducing points of failure within systems to understand their thresholds better and improve resilience. It was largely popularized by the success of Netflix's Simian Army, a suite of programs that would automatically introduce chaos by removing servers and regions and introducing other points of failure into production. All in the name of reliability and resiliency. While this idea isn't completely foreign to data engineering, it can certainly be described as an extremely uncommon practice. No data engineer in their right mind has looked at their to-do list, the unfilled roles on their team, the complexity of their pipelines, and then said: "This needs to be harder. Let's introduce some chaos." That may be part of the problem. Data teams need to think beyond providing snapshots of data quality to the business and start thinking about how to build and maintain reliable data systems at scale. We cannot afford to overlook data quality management, and it plays an increasingly large role in critical operations. For example, just this year, we witnessed how deleting one file, and an out-of-sync legacy database could ground more than 4,000 flights. Of course, you can't just copy and paste software engineering concepts straight into data engineering playbooks. Data is different. DataOps tweaks DevOps methodology as data observability does to observability. Consider this manifesto as a proposal for taking the proven concepts of chaos engineering and applying them to the eccentric world of data reliability. The 5 Laws of Data Chaos Engineering The principles and lessons of chaos engineering are a good place to start defining the contours of a data chaos engineering discipline. Our first law combines two of the most important. 1. Have a Bias for Production, But Minimize the Blast Radius There is a maxim among site reliability engineers that will ring true for every data engineer who has had the pleasure of the same SQL query returning two different results across staging and production environments. That is, "Nothing acts like prod except for prod." To that, I would add "production data too." Data is just too creative and fluid for humans to anticipate. Synthetic data has come a long way, and don't get me wrong, it can be a piece of the puzzle, but it's unlikely to simulate key edge cases. Like me, the mere thought of introducing points of failure into production systems probably makes your stomach churn. It's terrifying. Some data engineers justifiably wonder, "Is this even necessary within a modern data stack where so many tools abstract the underlying infrastructure?" I'm afraid so. Remember, as the opening anecdote and J-Kidd's snapped ligaments illustrated, the elasticity of the cloud is not a cure-all. In fact, it's that abstraction and opacity–along with the multiple integration points–that makes it so important to stress test a modern data stack. An on-premise database may be more limiting, but data teams tend to understand its thresholds as they hit them more regularly during day-to-day operations. Let's move past the philosophical objections for the moment and dive into the practical. Data is different. Introducing fake data into a system won't be helpful because the input changes the output. It's going to get really messy too. That's where the second part of the law comes into play: minimize the blast radius. There is a spectrum of chaos and tools that can be used: In words only, "let's say this failed; what would we do?" Synthetic data in production. Techniques like data diff allow you to test snippets of SQL code on production data. Solutions like LakeFS allow you to do this on a bigger scale by creating "chaos branches" or complete snapshots of your production environment where you can use production data but with complete isolation. Do it in prod, and practice your backfilling skills. After all, nothing acts like prod but prod. Starting with lesser chaotic scenarios is probably a good idea and will help you understand how to minimize the blast radius in production. Deep diving into real production incidents is also a great place to start. But does everyone really understand what exactly happened? Production incidents are chaos experiments that you've already paid for, so make sure that you are getting the most out of them. Mitigating the blast radius may also include strategies like backing up applicable systems or having data observability or data quality monitoring solution in place to assist with the detection and resolution of data incidents. 2. Understand It's Never a Perfect Time (Within Reason) Another chaos engineering principle holds to observe and understand "steady state behavior." There is wisdom in this principle, but it is also important to understand the field of data engineering isn't quite ready to measure by the standard of "5 9s" or 99.999% uptime. Data systems are constantly in flux, and there is a wider range of "steady state behavior." As a result, there will be the temptation to delay the introduction of chaos until you've reached the mythical point of "readiness." Unfortunately, you can't out-architect bad data; no one is ever ready for chaos. The Silicon Valley cliche of failing fast is applicable here. Or, to paraphrase Reid Hoffman, if you aren't embarrassed by the results of your first post-mortem/fire drill/chaos-introducing event, you introduced it too late. Introducing fake data incidents while you are dealing with real ones may seem silly. Still, ultimately this can help you get ahead by better understanding where you have been putting bandaids on larger issues that may need to be refactored. 3. Formulate Hypotheses and Identify Variables at the System, Code, and Data Levels Chaos engineering encourages forming hypotheses of how systems will react to understand what thresholds to monitor. It also encourages leveraging or mimicking past real-world incidents or likely incidents. We'll dive deeper into the details of this in the next article, but the important modification here is to ensure these span the system, code, and data levels. Variables at each level can create data incidents, some quick examples: System: You didn't have the right permissions set in your data warehouse. Code: A bad left JOIN. Data: A third-party sent you garbage columns with a bunch of NULLS. Simulating increased traffic levels and shutting down servers impact data systems, and those are important tests but don't neglect some of the more unique and fun ways data systems can break badly. 4. Everyone in One Room (Or at Least Zoom Call) This law is based on the experience of my colleague, site reliability engineer, and chaos practitioner Tim Tischler. "Chaos engineering is just as much about people as it is systems. They evolve together, and they can't be separated. Half of the value from these exercises comes from putting all the engineers in a room and asking, 'what happens if we do X or if we do Y?' You are guaranteed to get different answers. Once you simulate the event and see the result, now everyone's mental maps are aligned. That is incredibly valuable," he said. Also, the interdependence of data systems and responsibilities creates blurry lines of ownership, even on the most well-run teams. As a result, breaks often happen and are overlooked in those overlaps and gaps in responsibility where the data engineer, analytical engineer, and data analyst point at each other. In many organizations, the product engineers creating the data and the data engineers managing it are separated and siloed by team structures. They also often have different tools and models of the same system and data. Feel free to pull these product engineers in as well, especially when the data has been generated from internally built systems. Good incident management and triage can often involve multiple teams, and having everyone in one room can make the exercise more productive. I'll also add from personal experience that these exercises can be fun (in the same weird way putting all your chips on red is fun). I'd encourage data teams to consider a chaos data engineering fire drill or pre-mortem event at the next offsite. It makes for a much more practical team bonding exercise than getting out of an escape room. 5. Hold Off on the Automation for Now Truly mature chaos engineering programs like Netflix's Simian Army are automated and even unscheduled. While this may create a more accurate simulation, the reality is that automated tools don't currently exist for data engineering. Furthermore, if they did, I'm unsure if I would be brave enough to use them. To this point, one of the original Netflix chaos engineers has described how they didn't always use automation as the chaos could create more problems than they could fix (especially in collaboration with those running the system) in a reasonable period. Given data engineering's current reliability evolution and the greater potential for an unintentionally large blast radius, I would recommend data teams lean more towards scheduled, carefully managed events. Practice as You Play The important takeaway from the concept of chaos engineering is that practice and simulations are vital to performance and reliability. In my next article, I'll discuss specific things that can be broken at the system, code, and data level and what teams may find out about those systems by pushing them to their limits.

By Shane Murray

Top Performance Experts

expert thumbnail

Joana Carvalho

Performance Engineer,
Postman

‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎
expert thumbnail

Greg Leffler

Observability Practitioner, Director,
Splunk

Greg Leffler heads the Observability Practitioner team at Splunk and is on a mission to spread the good word of Observability to the world. Greg's career has taken him from the NOC to SRE, from SRE to management, with side stops in security and editorial functions. In addition to Observability, Greg's professional interests include hiring, training, SRE culture and operating effective remote teams. Greg holds a Master's Degree in Industrial/Organizational Psychology from Old Dominion University.
expert thumbnail

Ted Young

Director of Open Source Development,
LightStep

‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎
expert thumbnail

Eric D. Schabell

Director Technical Marketing & Evangelism,
Chronosphere

Eric is Chronosphere's Director Evangelism. He's renowned in the development community as a speaker, lecturer, author and baseball expert. His current role allows him to help the world understand the challenges they are facing with cloud native observability. He brings a unique perspective to the stage with a professional life dedicated to sharing his deep expertise of open source technologies and organizations. Follow on https://www.schabell.org.

The Latest Performance Topics

article thumbnail
Deploying Prometheus and Grafana as Applications using ArgoCD — Including Dashboards
Goodbye to the headaches of manual infrastructure management, and hello to a more efficient and scalable approach with ArgoCD.
March 31, 2023
by lidor ettinger
· 823 Views · 1 Like
article thumbnail
How to Monitor TextView Changes in Android
In this tutorial, we will see how to monitor the text changes in Android TextView or EditText.
March 30, 2023
by Nilanchala Panigrahy
· 5,660 Views · 0 Likes
article thumbnail
Unlock Customer Behavior With Time Series Analysis
A statistical method called time series analysis is used to examine and evaluate data that has been gathered over time.
March 30, 2023
by Prasanna Chitanand
· 779 Views · 1 Like
article thumbnail
gRPC vs REST: Differences, Similarities, and Why to Use Them
This article compares gRPC and REST client-server architectures for communication and compares their strengths and weaknesses.
March 29, 2023
by Shay Bratslavsky
· 4,986 Views · 1 Like
article thumbnail
Getting Started With Prometheus Workshop: Introduction to the Query Language
Interested in open-source observability? Learn about Prometheus Query Language and how to set up a demo project to provide more realistic data for querying.
March 29, 2023
by Eric D. Schabell CORE
· 1,476 Views · 2 Likes
article thumbnail
Overcoming Challenges and Best Practices for Data Migration From On-Premise to Cloud
This article discusses the challenges and best practices of data migration when transferring on-premise data to the cloud.
March 29, 2023
by srinivas Venkata
· 1,922 Views · 2 Likes
article thumbnail
Redefining the Boundaries of People, Process, and Platforms
Kelsey Hightower shares his thoughts on the the future of developers and engineers.
March 29, 2023
by Tom Smith CORE
· 2,409 Views · 1 Like
article thumbnail
What are Hub, Switch, Router, and Modem?
In this article, we will discuss hub, switch, router, and modem, their integral functions, and how they differ from one another.
March 29, 2023
by Aditya Bhuyan
· 2,267 Views · 2 Likes
article thumbnail
7 Ways for Better Collaboration Among Your Testers and Developers
The collab between developer and tester is crucial to timely deliver your web application. Read more and find out 7 ways to achieve it. (Psst.. Look out for #4)
March 28, 2023
by Praveen Mishra
· 1,403 Views · 2 Likes
article thumbnail
Legacy Application Refactoring: How To Revive Your Aging Software
In this legacy application refactoring guide, readers will discover the benefits of refactoring and how to identify what parts of your software need it.
March 28, 2023
by Tejas Kaneriya
· 1,551 Views · 1 Like
article thumbnail
OpenShift Container Platform 3.11 Cost Optimization on Public Cloud Platforms
A developer gives a tutorial on optimizing the performance of OpenShift containers using some shell scripts. Read on to learn more!
March 28, 2023
by Ganesh Bhat
· 9,341 Views · 3 Likes
article thumbnail
How Can Digital Testing Help in the Product Roadmap
This article explains the importance of a product roadmap and how digital experience testing can help in creating the product roadmap.
March 28, 2023
by Anusha K
· 2,314 Views · 1 Like
article thumbnail
Detecting Network Anomalies Using Apache Spark
Apache Spark provides a powerful platform for detecting network anomalies using big data processing and machine learning techniques.
March 28, 2023
by Rama Krishna Panguluri
· 2,699 Views · 1 Like
article thumbnail
How Agile Architecture Spikes Are Used in Shift-Left BDD
An architecture spike in agile methodologies usually implies a software development method, which originates in the extreme programming offshoot of agile.
March 27, 2023
by Mirza Sisic
· 1,811 Views · 1 Like
article thumbnail
Assessment of Scalability Constraints (and Solutions)
Scaling in the age of serverless and microservices is very different than it was a decade ago. Explore practical advice for overcoming scalability challenges.
March 27, 2023
by Shai Almog CORE
· 3,258 Views · 3 Likes
article thumbnail
The Evolution of Incident Management from On-Call to SRE
Incident management has evolved considerably over the last couple of decades.
March 26, 2023
by Vardhan NS
· 2,221 Views · 1 Like
article thumbnail
MongoDB Time Series Benchmark and Review
This article compares QuestDB with MongoDB. We look at the two databases in terms of benchmark performance and user experience.
March 24, 2023
by Amy Wang
· 2,616 Views · 1 Like
article thumbnail
Loop Device in Linux
Learn how to access the contents inside a new Linux distribution ISO image prior to repartitioning your disk and installing the operating system onto your local disk.
March 24, 2023
by Priyanka Nawalramka
· 2,133 Views · 1 Like
article thumbnail
Best Practices for Setting up Monitoring Operations for Your AI Team
In this post, we'll explore key tips to help you set up a robust monitoring operation that proactively addresses issues before they negatively impact your business KPIs.
March 24, 2023
by Itai Bar-Sinai
· 3,238 Views · 2 Likes
article thumbnail
Testing Level Dynamics: Achieving Confidence From Testing
In this article, explore shared experiences to gain insight into how teams have tried to achieve confidence from testing.
March 24, 2023
by Stelios Manioudakis
· 15,853 Views · 2 Likes
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: