Cloud + data orchestration: Demolish your data silos. Enable complex analytics. Eliminate I/O bottlenecks. Learn the essentials (and more)!
2024 DZone Community Survey: SMEs wanted! Help shape the future of DZone. Share your insights and enter to win swag!
A developer's work is never truly finished once a feature or change is deployed. There is always a need for constant maintenance to ensure that a product or application continues to run as it should and is configured to scale. This Zone focuses on all your maintenance must-haves — from ensuring that your infrastructure is set up to manage various loads and improving software and data quality to tackling incident management, quality assurance, and more.
Enhance IaC Security With Mend Scans
Poetry Explained: Perils of the Unpinned Dependencies
You should have heard about SonarQube as a code scanning and code quality check tool. SonarQube doesn't support Ansible by default. A plugin needs to be set up to scan Ansible playbooks or roles. In this article, you will learn on how to set up and use SonarQube on your Ansible (YAML) code for linting and code analysis. This article uses the community edition of SonarQube. What Is Ansible? As explained in previous articles around Ansible: Ansible Beyond Automation and Automation Ansible AI, Ansible is a simple IT automation tool that helps you provision infrastructure, install software, and support application automation through advanced workflows. Ansible playbooks are written in YAML format and define a series of tasks to be executed on remote hosts. Playbooks offer a clear, human-readable way to describe complex automation workflows. Using playbooks, you define the required dependencies and desired state for your application. What Is SonarQube? SonarQube is a widely used open-source platform for continuous code quality inspection and analysis. It is designed to help developers and teams identify and address potential issues in their codebase, such as bugs, code smells, security vulnerabilities, and technical debt. SonarQube supports a wide range of programming languages, including Java, C#, C/C++, Python, JavaScript, and many others. The community edition of SonarQube can perform static code analysis for 19 languages like Terraform, code formation, Docker, Ruby, Kotlin, Go, etc., Comparison of SonarQube Editions Code Scanning and Analysis SonarQube performs static code analysis, which means it examines the source code without executing it. This analysis is performed by parsing the code and applying a set of predefined rules and patterns to identify potential issues. SonarQube covers various aspects of code quality, including: Code smells: SonarQube can detect code smells, which are indicators of potential maintainability issues or design flaws in the codebase. Examples include duplicated code, complex methods, and excessive coupling. Bugs: SonarQube can identify potential bugs in the code, such as null pointer dereferences, resource leaks, and other common programming errors. Security vulnerabilities: SonarQube can detect security vulnerabilities in the code, such as SQL injection, cross-site scripting (XSS), and other security flaws. Technical debt: SonarQube can estimate the technical debt of a codebase, which represents the effort required to fix identified issues and bring the code up to a desired level of quality. Importance of Code Scanning and Analysis Code scanning and analysis with SonarQube offer several benefits to development teams: Improved code quality: By identifying and addressing issues early in the development process, teams can improve the overall quality of their codebase, reducing the likelihood of bugs and making the code more maintainable. Increased productivity: By automating the code analysis process, SonarQube saves developers time and effort that would otherwise be spent manually reviewing code. Consistent code standards: SonarQube can enforce coding standards and best practices across the entire codebase, ensuring consistency and adherence to established guidelines. Security awareness: By detecting security vulnerabilities early, teams can address them before they become exploitable in production environments, reducing the risk of security breaches. Technical debt management: SonarQube's technical debt estimation helps teams prioritize and manage the effort required to address identified issues, ensuring that the codebase remains maintainable and extensible. Perform Static Application Security Testing SonarQube is a leading tool for performing SAST, offering comprehensive capabilities to enhance code security and quality. Static Application Security Testing (SAST) is a method of security testing that analyzes source code to identify vulnerabilities and security flaws. Unlike Dynamic Application Security Testing (DAST), which tests running applications, SAST examines the code itself, making it a form of white-box testing. SonarQube integrates seamlessly with popular development tools and continuous integration/continuous deployment (CI/CD) pipelines, making it easy to incorporate code analysis into the development workflow. With its comprehensive analysis capabilities and support for various programming languages, SonarQube has become an essential tool for development teams seeking to improve code quality, maintain a secure and maintainable codebase, and deliver high-quality software products. Install SonarQube on Your Local Machine You can set it up using a zip file or you can spin up a Docker container using one of SonarQube's Docker images. 1. Download and install Java 17 from Eclipse Temurin Latest Releases. If you are using a macOS, you can install using HomeBrew with the below command. Shell brew install --cask temurin@17 2. Download the SonarQube Community Edition zip file. 3. As mentioned in the SonarQube documentation, as a non-root user unzip the downloaded SonarQube community edition zip file to C:\sonarqube on Windows or on Linux / macOS /opt/sonarqube On Linux / macOS, you may have to run a command to create folder as a root sudo mkdir -p /opt/sonarqube 4. The folder structure in your /opt/sonarqube should look similar to the below image. The key folders that you will be using for this article would be bin and extensions/plugins SonarQube Community edition folder structure 5. To start the SonarQube server, change to the directory where you unzipped the community edition and run the below commands under the respective Operating System. For example, If you are running on a macOS, you will change the directory to /opt/sonarqube/bin/macosx-universal-64 Shell # On Windows, execute: C:\sonarqube\bin\windows-x86-64\StartSonar.bat # On other operating systems, as a non-root user execute: /opt/sonarqube/bin/<OS>/sonar.sh console Here's the folder structure under the bin folder. bin folder structure 6. On a macOS, this is how it looks when you run the server with Java 17 setup Shell # To change to the directory and execute cd /opt/sonarqube/bin/macosx-universal-64 ./sonar.sh console SonarQube server up and running If you are using a Docker image of the community edition from the Dockerhub, run the below command Shell docker run -d --name sonarqube -e SONAR_ES_BOOTSTRAP_CHECKS_DISABLE=true -p 9000:9000 sonarqube:latest 7. You can access the SonarQube server at this localhost. Initial system administrator username: admin and password: admin. You will be asked to reset the password once logged in. SonarQube console SonarQube Projects A SonarQube project represents a codebase that you want to analyze. Each project is identified by a unique key and can be configured with various settings, such as the programming languages used, the source code directories, and the quality gates (thresholds for code quality metrics). You can create a new project in SonarQube through the web interface or automatically during the first analysis of your codebase. When creating a project manually, you need to provide a project key and other details like the project name and visibility settings. Scanner CLI for SonarQube A scanner is required to be set up that will be used to run code analysis on SonarQube. Project configuration is read from file sonar-project.properties or passed on the command line. The SonarScanner CLI (Command Line Interface) is a tool that allows you to analyze your codebase from the command line. It is the recommended scanner when there is no specific scanner available for your build system or when you want to run the analysis outside of your build process. Download and Configure SonarScanner CLI Based on the Operating system, you are running your SonarQube server, download the sonar-scanner from this link. Unzip or expand the downloaded file into the directory of your choice. Let's refer to it as <INSTALL_DIRECTORY> in the next steps. Update the global settings to point to your SonarQube server by editing $install_directory/conf/sonar-scanner.properties Plain Text # Configure here general information about the environment, such as the server connection details for example # No information about specific project should appear here #----- SonarQube server URL (default to SonarCloud) sonar.host.url=http://localhost:9000/ #sonar.scanner.proxyHost=myproxy.mycompany.com #sonar.scanner.proxyPort=8002 4. Add the <INSTALL_DIRECTORY>/bin directory to your path. If you are using macOS or Linux, add this to your ~/.bashrc or ~/.zshrc and source the file source ~/.bashrc Setup Ansible Plugin Before you set up the SonarQube plugin for Ansible, install ansible-lint Shell npm install -g ansible-lint On macOS, if you have homebrew installed, use this command brew install ansible-lint To install and setup the SonarQube plugin for Ansible, follow the instructions here Download the YAML and Ansible SonarQube plugins Copy them into the extensions/pluginsdirectory of SonarQube and restart SonarQube LaTeX ├── README.txt ├── sonar-ansible-plugin-2.5.1.jar └── sonar-yaml-plugin-1.9.1.jar Log into SonarQube Server console. Click on Quality Profiles to create a new quality profile for YAML. Quality Profiles 5. Click Create. 6. Select Copy from an existing quality profile, fill in the below details and click Create. Language: YAML Parent: YAML Analyzer (Built-in) Name: ansible-scan New quality profile 7. Activate the Ansible rules on the ansible-scan quality profile by clicking on the menu icon and selecting Active More Rules. Activate more rules for Ansible 8. Search with the tag "ansible" and from the Bulk Change, Click on Activate in ansible-scan. Search and apply 9. Set ansible-scan as the Default. The Ansible rules will be applicable to other YAML files. You can now see that for YAML you have 20 rules and for Ansible you have 38 rules. Set ansible-scan Create a New Project and Run Your First Scan 1. Navigate to the localhost on your browser to launch the SonarQube Server console. 2. Click Create Project and select Local project. For demo purpose, you can download Ansible code from this GitHub repository. Create local project 3. Enter a project displayname, project key, branch name, and click Next. Local project creation 4. Under Choose the baseline for new code for this project, select Use the global setting and click Create project. Read the information below the selection to understand why you should pick this choice. Select settings 5. Select Locally under the Analysis Method as you will be running this locally on your machine. Analysis method 6. Under Provide a token, select Generate a token. Give your token a name, click Generate, and click Continue. Under Run analysis on your project, Select Other. Select the Operating System(OS). 7. Click on the Copy icon to save the commands to the clipboard. Generate token 8. On a terminal or command prompt, navigate to your Ansible code folder, and paste and execute commands in your project's folder. You can see the Ansible-lint rules called in the log. Plain Text INFO: ansible version: INFO: ansible [core 2.17.0] INFO: config file = None INFO: configured module search path = ['/Users/vmac/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] INFO: ansible python module location = /usr/local/Cellar/ansible/10.0.1/libexec/lib/python3.12/site-packages/ansible INFO: ansible collection location = /Users/vmac/.ansible/collections:/usr/share/ansible/collections INFO: executable location = /usr/local/bin/ansible INFO: python version = 3.12.3 (main, Apr 9 2024, 08:09:14) [Clang 15.0.0 (clang-1500.3.9.4)] (/usr/local/Cellar/ansible/10.0.1/libexec/bin/python) INFO: jinja version = 3.1.4 INFO: libyaml = True INFO: ansible-lint version: INFO: ansible-lint 24.6.0 using ansible 9. On the SonarQube server console, you can see the analysis information Overview Ansible code analyzed Conclusion In this article, you learned how to install, configure, and run the SonarQube plugin for Ansible that allows developers and operations teams to analyze the Ansible playbooks and/or roles for code quality, security vulnerabilities, and best practices. It leverages the YAML SonarQube plugin and adds additional rules specifically tailored for Ansible. Suggested Reading If you are new to Ansible and want to learn the tools and capabilities it provides, check my previous articles: Ansible Beyond Automation Automation Ansible AI
Debugging application issues in a Kubernetes cluster can often feel like navigating a labyrinth. Containers are ephemeral by design and intended to be immutable once deployed. This presents a unique challenge when something goes wrong and we need to dig into the issue. Before diving into the debugging tools and techniques, it's essential to grasp the core problem: why modifying container instances directly is a bad idea. This blog post will walk you through the intricacies of Kubernetes debugging, offering insights and practical tips to effectively troubleshoot your Kubernetes environment. The Problem With Kubernetes Video The Immutable Nature of Containers One of the fundamental principles of Kubernetes is the immutability of container instances. This means that once a container is running, it shouldn't be altered. Modifying containers on the fly can lead to inconsistencies and unpredictable behavior, especially as Kubernetes orchestrates the lifecycle of these containers, replacing them as needed. Imagine trying to diagnose an issue only to realize that the container you’re investigating has been modified, making it difficult to reproduce the problem consistently. The idea behind this immutability is to ensure that every instance of a container is identical to any other instance. This consistency is crucial for achieving reliable, scalable applications. If you start modifying containers, you undermine this consistency, leading to a situation where one container behaves differently from another, even though they are supposed to be identical. The Limitations of kubectl exec We often start our journey in Kubernetes with commands such as: $ kubectl -- exec -ti <pod-name> This logs into a container and feels like accessing a traditional server with SSH. However, this approach has significant limitations. Containers often lack basic diagnostic tools—no vim, no traceroute, sometimes not even a shell. This can be a rude awakening for those accustomed to a full-featured Linux environment. Additionally, if a container crashes, kubectl exec becomes useless as there's no running instance to connect to. This tool is insufficient for thorough debugging, especially in production environments. Consider the frustration of logging into a container only to find out that you can't even open a simple text editor to check configuration files. This lack of basic tools means that you are often left with very few options for diagnosing problems. Moreover, the minimalistic nature of many container images, designed to reduce their attack surface and footprint, exacerbates this issue. Avoiding Direct Modifications While it might be tempting to install missing tools on the fly using commands like apt-get install vim, this practice violates the principle of container immutability. In production, installing packages dynamically can introduce new dependencies, potentially causing application failures. The risks are high, and it's crucial to maintain the integrity of your deployment manifests, ensuring that all configurations are predefined and reproducible. Imagine a scenario where a quick fix in production involves installing a missing package. This might solve the immediate problem but could lead to unforeseen consequences. Dependencies introduced by the new package might conflict with existing ones, leading to application instability. Moreover, this approach makes it challenging to reproduce the exact environment, which is vital for debugging and scaling your application. Enter Ephemeral Containers The solution to the aforementioned problems lies in ephemeral containers. Kubernetes allows the creation of these temporary containers within the same pod as the application container you need to debug. These ephemeral containers are isolated from the main application, ensuring that any modifications or tools installed do not impact the running application. Ephemeral containers provide a way to bypass the limitations of kubectl exec without violating the principles of immutability and consistency. By launching a separate container within the same pod, you can inspect and diagnose the application container without altering its state. This approach preserves the integrity of the production environment while giving you the tools you need to debug effectively. Using kubectl debug The kubectl debug command is a powerful tool that simplifies the creation of ephemeral containers. Unlike kubectl exec, which logs into the existing container, kubectl debug creates a new container within the same namespace. This container can run a different OS, mount the application container’s filesystem, and provide all necessary debugging tools without altering the application’s state. This method ensures you can inspect and diagnose issues even if the original container is not operational. For example, let’s consider a scenario where we’re debugging a container using an ephemeral Ubuntu container: kubectl debug <myapp> -it <pod-name> --image=ubuntu --share-process --copy-to=<myapp-debug> This command launches a new Ubuntu-based container within the same pod, providing a full-fledged environment to diagnose the application container. Even if the original container lacks a shell or crashes, the ephemeral container remains operational, allowing you to perform necessary checks and install tools as needed. It relies on the fact that we can have multiple containers in the same pod, that way we can inspect the filesystem of the debugged container without physically entering that container. Practical Application of Ephemeral Containers To illustrate, let’s delve deeper into how ephemeral containers can be used in real-world scenarios. Suppose you have a container that consistently crashes due to a mysterious issue. By deploying an ephemeral container with a comprehensive set of debugging tools, you can monitor the logs, inspect the filesystem, and trace processes without worrying about the constraints of the original container environment. For instance, you might encounter a situation where an application container crashes due to an unhandled exception. By using kubectl debug, you can create an ephemeral container that shares the same network namespace as the original container. This allows you to capture network traffic and analyze it to understand if there are any issues related to connectivity or data corruption. Security Considerations While ephemeral containers reduce the risk of impacting the production environment, they still pose security risks. It’s critical to restrict access to debugging tools and ensure that only authorized personnel can deploy ephemeral containers. Treat access to these systems with the same caution as handing over the keys to your infrastructure. Ephemeral containers, by their nature, can access sensitive information within the pod. Therefore, it is essential to enforce strict access controls and audit logs to track who is deploying these containers and what actions are being taken. This ensures that the debugging process does not introduce new vulnerabilities or expose sensitive data. Interlude: The Role of Observability While tools like kubectl exec and kubectl debug are invaluable for troubleshooting, they are not replacements for comprehensive observability solutions. Observability allows you to monitor, trace, and log the behavior of your applications in real time, providing deeper insights into issues without the need for intrusive debugging sessions. These tools aren't meant for everyday debugging: that role should be occupied by various observability tools. I will discuss observability in more detail in an upcoming post. Command Line Debugging While tools like kubectl exec and kubectl debug are invaluable, there are times when you need to dive deep into the application code itself. This is where we can use command line debuggers. Command line debuggers allow you to inspect the state of your application at a very granular level, stepping through code, setting breakpoints, and examining variable states. Personally, I don't use them much. For instance, Java developers can use jdb, the Java Debugger, which is analogous to gdb for C/C++ programs. Here’s a basic rundown of how you might use jdb in a Kubernetes environment: 1. Set Up Debugging First, you need to start your Java application with debugging enabled. This typically involves adding a debug flag to your Java command. However, as discussed in my post here, there's an even more powerful way that doesn't require a restart: java -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005 -jar myapp.jar 2. Port Forwarding Since the debugger needs to connect to the application, you’ll set up port forwarding to expose the debug port of your pod to your local machine. This is important as JDWP is dangerous: kubectl port-forward <pod-name> 5005:5005 3. Connecting the Debugger With port forwarding in place, you can now connect jdb to the remote application: jdb -attach localhost:5005 From here, you can use jdb commands to set breakpoints, step through code, and inspect variables. This process allows you to debug issues within the code itself, which can be invaluable for diagnosing complex problems that aren’t immediately apparent through logs or superficial inspection. Connecting a Standard IDE for Remote Debugging I prefer IDE debugging by far. I never used JDB for anything other than a demo. Modern IDEs support remote debugging, and by leveraging Kubernetes port forwarding, you can connect your IDE directly to a running application inside a pod. To set up remote debugging we start with the same steps as the command line debugging. Configuring the application and setting up the port forwarding. 1. Configure the IDE In your IDE (e.g., IntelliJ IDEA, Eclipse), set up a remote debugging configuration. Specify the host as localhost and the port as 5005. 2. Start Debugging Launch the remote debugging session in your IDE. You can now set breakpoints, step through code, and inspect variables directly within the IDE, just as if you were debugging a local application. Conclusion Debugging Kubernetes environments requires a blend of traditional techniques and modern tools designed for container orchestration. Understanding the limitations of kubectl exec and the benefits of ephemeral containers can significantly enhance your troubleshooting process. However, the ultimate goal should be to build robust observability into your applications, reducing the need for ad-hoc debugging and enabling proactive issue detection and resolution. By following these guidelines and leveraging the right tools, you can navigate the complexities of Kubernetes debugging with confidence and precision. In the next installment of this series, we’ll delve into common configuration issues in Kubernetes and how to address them effectively.
Tech teams do their best to develop amazing software products. They spent countless hours coding, testing, and refining every little detail. However, even the most carefully crafted systems may encounter issues along the way. That's where reliability models and metrics come into play. They help us identify potential weak spots, anticipate failures, and build better products. The reliability of a system is a multidimensional concept that encompasses various aspects, including, but not limited to: Availability: The system is available and accessible to users whenever needed, without excessive downtime or interruptions. It includes considerations for system uptime, fault tolerance, and recovery mechanisms. Performance: The system should function within acceptable speed and resource usage parameters. It scales efficiently to meet growing demands (increasing loads, users, or data volumes). This ensures a smooth user experience and responsiveness to user actions. Stability: The software system operates consistently over time and maintains its performance levels without degradation or instability. It avoids unexpected crashes, freezes, or unpredictable behavior. Robustness: The system can gracefully handle unexpected inputs, invalid user interactions, and adverse conditions without crashing or compromising its functionality. It exhibits resilience to errors and exceptions. Recoverability: The system can recover from failures, errors, or disruptions and restore normal operation with minimal data loss or impact on users. It includes mechanisms for data backup, recovery, and rollback. Maintainability: The system should be easy to understand, modify, and fix when necessary. This allows for efficient bug fixes, updates, and future enhancements. This article starts by analyzing mean time metrics. Basic probability distribution models for reliability are then highlighted with their pros and cons. A distinction between software and hardware failure models follows. Finally, reliability growth models are explored including a list of factors for how to choose the right model. Mean Time Metrics Some of the most commonly tracked metrics in the industry are MTTA (mean time to acknowledge), MTBF (mean time before failure), MTTR (mean time to recovery, repair, respond or resolve), and MTTF (mean time to failure). They help tech teams understand how often incidents occur and how quickly the team bounces back from those incidents. The acronym MTTR can be misleading. When discussing MTTR, it might seem like a singular metric with a clear definition. However, it actually encompasses four distinct measurements. The 'R' in MTTR can signify repair, recovery, response, or resolution. While these four metrics share similarities, each carries its own significance and subtleties. Mean Time To Repair: This focuses on the time it takes to fix a failed component. Mean Time To Recovery: This considers the time to restore full functionality after a failure. Mean Time To Respond: This emphasizes the initial response time to acknowledge and investigate an incident. Mean Time To Resolve: This encompasses the entire incident resolution process, including diagnosis, repair, and recovery. While these metrics overlap, they provide a distinct perspective on how quickly a team resolves incidents. MTTA, or Mean Time To Acknowledge, measures how quickly your team reacts to alerts by tracking the average time from alert trigger to initial investigation. It helps assess both team responsiveness and alert system effectiveness. MTBF or Mean Time Between Failures, represents the average time a repairable system operates between unscheduled failures. It considers both the operating time and the repair time. MTBF helps estimate how often a system is likely to experience a failure and require repair. It's valuable for planning maintenance schedules, resource allocation, and predicting system uptime. For a system that cannot or should not be repaired, MTTF, or Mean Time To Failure, represents the average time that the system operates before experiencing its first failure. Unlike MTBF, it doesn't consider repair times. MTTF is used to estimate the lifespan of products that are not designed to be repaired after failing. This makes MTTF particularly relevant for components or systems where repair is either impossible or not economically viable. It's useful for comparing the reliability of different systems or components and informing design decisions for improved longevity. An analogy to illustrate the difference between MTBF and MTTF could be a fleet of delivery vans. MTBF: This would represent the average time between breakdowns for each van, considering both the driving time and the repair time it takes to get the van back on the road. MTTF: This would represent the average lifespan of each van before it experiences its first breakdown, regardless of whether it's repairable or not. Key Differentiators Feature MTBF MTTF Repairable System Yes No Repair Time Considered in the calculation Not considered in the calculation Failure Focus Time between subsequent failures Time to the first failure Application Planning maintenance, resource allocation Assessing inherent system reliability The Bigger Picture MTTR, MTTA, MTTF, and MTBF can also be used all together to provide a comprehensive picture of your team's effectiveness and areas for improvement. Mean time to recovery indicates how quickly you get systems operational again. Incorporating mean time to respond allows you to differentiate between team response time and alert system efficiency. Adding mean time to repair further breaks down how much time is spent on repairs versus troubleshooting. Mean time to resolve incorporates the entire incident lifecycle, encompassing the impact beyond downtime. But the story doesn't end there. Mean time between failures reveals your team's success in preventing or reducing future issues. Finally, incorporating mean time to failure provides insights into the overall lifespan and inherent reliability of your product or system. Probability Distributions for Reliability The following probability distributions are commonly used in reliability engineering to model the time until the failure of systems or components. They are often employed in reliability analysis to characterize the failure behavior of systems over time. Exponential Distribution Model This model assumes a constant failure rate over time. This means that the probability of a component failing is independent of its age or how long it has been operating. Applications: This model is suitable for analyzing components with random failures, such as memory chips, transistors, or hard drives. It's particularly useful in the early stages of a product's life cycle when failure data might be limited. Limitations: The constant failure rate assumption might not always hold true. As hardware components age, they might become more susceptible to failures (wear-out failures), which the Exponential Distribution Model wouldn't capture. Weibull Distribution Model This model offers more flexibility by allowing dynamic failure rates. It can model situations where the probability of failure increases over time at an early stage (infant mortality failures) or at a later stage (wear-out failures). Infant mortality failures: This could represent new components with manufacturing defects that are more likely to fail early on. Wear-out failures: This could represent components like mechanical parts that degrade with use and become more likely to fail as they age. Applications: The Weibull Distribution Model is more versatile than the Exponential Distribution Model. It's a good choice for analyzing a wider range of hardware components with varying failure patterns. Limitations: The Weibull Distribution Model requires more data to determine the shape parameter that defines the failure rate behavior (increasing, decreasing, or constant). Additionally, it might be too complex for situations where a simpler model like the Exponential Distribution would suffice. The Software vs Hardware Distinction The nature of software failures is different from that of hardware failures. Although both software and hardware may experience deterministic as well as random failures, their failures have different root causes, different failure patterns, and different prediction, prevention, and repair mechanisms. Depending on the level of interdependence between software and hardware and how it affects our systems, it may be beneficial to consider the following factors: 1. Root Cause of Failures Hardware: Hardware failures are physical in nature, caused by degradation of components, manufacturing defects, or environmental factors. These failures are often random and unpredictable. Consequently, hardware reliability models focus on physical failure mechanisms like fatigue, corrosion, and material defects. Software: Software failures usually stem from logical errors, code defects, or unforeseen interactions with the environment. These failures may be systematic and can be traced back to specific lines of code or design flaws. Consequently, software reliability models do not account for physical degradation over time. 2. Failure Patterns Hardware: Hardware failures often exhibit time-dependent behavior. Components might be more susceptible to failures early in their lifespan (infant mortality) or later as they wear out. Software: The behavior of software failures in time can be very tricky and usually depends on the evolution of our code, among others. A bug in the code will remain a bug until it's fixed, regardless of how long the software has been running. 3. Failure Prediction, Prevention, Repairs Hardware: Hardware reliability models that use MTBF often focus on predicting average times between failures and planning preventive maintenance schedules. Such models analyze historical failure data from identical components. Repairs often involve the physical replacement of components. Software: Software reliability models like Musa-Okumoto and Jelinski-Moranda focus on predicting the number of remaining defects based on testing data. These models consider code complexity and defect discovery rates to guide testing efforts and identify areas with potential bugs. Repair usually involves debugging and patching, not physical replacement. 4. Interdependence and Interaction Failures The level of interdependence between software and hardware varies for different systems, domains, and applications. Tight coupling between software and hardware may cause interaction failures. There can be software failures due to hardware and vice-versa. Here's a table summarizing the key differences: Feature Hardware Reliability Models Software Reliability Models Root Cause of Failures Physical Degradation, Defects, Environmental Factors Code Defects, Design Flaws, External Dependencies Failure Patterns Time-Dependent (Infant Mortality, Wear-Out) Non-Time Dependent (Bugs Remain Until Fixed) Prediction Focus Average Times Between Failures (MTBF, MTTF) Number of Remaining Defects Prevention Strategies Preventive Maintenance Schedules Code Review, Testing, Bug Fixes By understanding the distinct characteristics of hardware and software failures, we may be able to leverage tailored reliability models, whenever necessary, to gain in-depth knowledge of our system's behavior. This way we can implement targeted strategies for prevention and mitigation in order to build more reliable systems. Code Complexity Code complexity assesses how difficult a codebase is to understand and maintain. Higher complexity often correlates with an increased likelihood of hidden bugs. By measuring code complexity, developers can prioritize testing efforts and focus on areas with potentially higher defect density. The following tools can automate the analysis of code structure and identify potential issues like code duplication, long functions, and high cyclomatic complexity: SonarQube: A comprehensive platform offering code quality analysis, including code complexity metrics Fortify: Provides static code analysis for security vulnerabilities and code complexity CppDepend (for C++): Analyzes code dependencies and metrics for C++ codebases PMD: An open-source tool for identifying common coding flaws and complexity metrics Defect Density Defect density illuminates the prevalence of bugs within our code. It's calculated as the number of defects discovered per unit of code, typically lines of code (LOC). A lower defect density signifies a more robust and reliable software product. Reliability Growth Models Reliability growth models help development teams estimate the testing effort required to achieve desired reliability levels and ensure a smooth launch of their software. These models predict software reliability improvements as testing progresses, offering insights into the effectiveness of testing strategies and guiding resource allocation. They are mathematical models used to predict and improve the reliability of systems over time by analyzing historical data on defects or failures and their removal. Some models exhibit characteristics of exponential growth. Other models exhibit characteristics of power law growth while there exist models that exhibit both exponential and power law growth. The distinction is primarily based on the underlying assumptions about how the fault detection rate changes over time in relation to the number of remaining faults. While a detailed analysis of reliability growth models is beyond the scope of this article, I will provide a categorization that may help for further study. Traditional growth models encompass the commonly used and foundational models, while the Bayesian approach represents a distinct methodology. The advanced growth models encompass more complex models that incorporate additional factors or assumptions. Please note that the list is indicative and not exhaustive. Traditional Growth Models Musa-Okumoto Model It assumes a logarithmic Poisson process for fault detection and removal, where the number of failures observed over time follows a logarithmic function of the number of initial faults. Jelinski-Moranda Model It assumes a constant failure intensity over time and is based on the concept of error seeding. It postulates that software failures occur at a rate proportional to the number of remaining faults in the system. Goel-Okumoto Model It incorporates the assumption that the fault detection rate decreases exponentially as faults are detected and fixed. It also assumes a non-homogeneous Poisson process for fault detection. Non-Homogeneous Poisson Process (NHPP) Models They assume the fault detection rate is time-dependent and follows a non-homogeneous Poisson process. These models allow for more flexibility in capturing variations in the fault detection rate over time. Bayesian Approach Wall and Ferguson Model It combines historical data with expert judgment to update reliability estimates over time. This model considers the impact of both defect discovery and defect correction efforts on reliability growth. Advanced Growth Models Duane Model This model assumes that the cumulative MTBF of a system increases as a power-law function of the cumulative test time. This is known as the Duane postulate and it reflects how quickly the reliability of the system is improving as testing and debugging occur. Coutinho Model Based on the Duane model, it extends to the idea of an instantaneous failure rate. This rate involves the number of defects found and the number of corrective actions made during testing time. This model provides a more dynamic representation of reliability growth. Gooitzen Model It incorporates the concept of imperfect debugging, where not all faults are detected and fixed during testing. This model provides a more realistic representation of the fault detection and removal process by accounting for imperfect debugging. Littlewood Model It acknowledges that as system failures are discovered during testing, the underlying faults causing these failures are repaired. Consequently, the reliability of the system should improve over time. This model also considers the possibility of negative reliability growth when a software repair introduces further errors. Rayleigh Model The Rayleigh probability distribution is a special case of the Weibull distribution. This model considers changes in defect rates over time, especially during the development phase. It provides an estimation of the number of defects that will occur in the future based on the observed data. Choosing the Right Model There's no single "best" reliability growth model. The ideal choice depends on the specific project characteristics and available data. Here are some factors to consider. Specific objectives: Determine the specific objectives and goals of reliability growth analysis. Whether the goal is to optimize testing strategies, allocate resources effectively, or improve overall system reliability, choose a model that aligns with the desired outcomes. Nature of the system: Understand the characteristics of the system being analyzed, including its complexity, components, and failure mechanisms. Certain models may be better suited for specific types of systems, such as software, hardware, or complex systems with multiple subsystems. Development stage: Consider the stage of development the system is in. Early-stage development may benefit from simpler models that provide basic insights, while later stages may require more sophisticated models to capture complex reliability growth behaviors. Available data: Assess the availability and quality of data on past failures, fault detection, and removal. Models that require extensive historical data may not be suitable if data is limited or unreliable. Complexity tolerance: Evaluate the complexity tolerance of the stakeholders involved. Some models may require advanced statistical knowledge or computational resources, which may not be feasible or practical for all stakeholders. Assumptions and limitations: Understand the underlying assumptions and limitations of each reliability growth model. Choose a model whose assumptions align with the characteristics of the system and the available data. Predictive capability: Assess the predictive capability of the model in accurately forecasting future reliability levels based on past data. Flexibility and adaptability: Consider the flexibility and adaptability of the model to different growth patterns and scenarios. Models that can accommodate variations in fault detection rates, growth behaviors, and system complexities are more versatile and applicable in diverse contexts. Resource requirements: Evaluate the resource requirements associated with implementing and using the model, including computational resources, time, and expertise. Choose a model that aligns with the available resources and capabilities of the organization. Validation and verification: Verify the validity and reliability of the model through validation against empirical data or comparison with other established models. Models that have been validated and verified against real-world data are more trustworthy and reliable. Regulatory requirements: Consider any regulatory requirements or industry standards that may influence the choice of reliability growth model. Certain industries may have specific guidelines or recommendations for reliability analysis that need to be adhered to. Stakeholder input: Seek input and feedback from relevant stakeholders, including engineers, managers, and domain experts, to ensure that the chosen model meets the needs and expectations of all parties involved. Wrapping Up Throughout this article, we explored a plethora of reliability models and metrics. From the simple elegance of MTTR to the nuanced insights of NHPP models, each instrument offers a unique perspective on system health. The key takeaway? There's no single "rockstar" metric or model that guarantees system reliability. Instead, we should carefully select and combine the right tools for the specific system at hand. By understanding the strengths and limitations of various models and metrics, and aligning them with your system's characteristics, you can create a comprehensive reliability assessment plan. This tailored approach may allow us to identify potential weaknesses and prioritize improvement efforts.
The amount of data generated by modern systems has become a double-edged sword for security teams. While it offers valuable insights, sifting through mountains of logs and alerts manually to identify malicious activity is no longer feasible. Here's where rule-based incident detection steps in, offering a way to automate the process by leveraging predefined rules to flag suspicious activity. However, the choice of tool for processing high-volume data for real-time insights is crucial. This article delves into the strengths and weaknesses of two popular options: Splunk, a leading batch search tool, and Flink, a powerful stream processing framework, specifically in the context of rule-based security incident detection. Splunk: Powerhouse Search and Reporting Splunk has become a go-to platform for making application and infrastructure logs readily available for ad-hoc search. Its core strength lies in its ability to ingest log data from various sources, centralize it, and enable users to explore it through powerful search queries. This empowers security teams to build comprehensive dashboards and reports, providing a holistic view of their security posture. Additionally, Splunk supports scheduled searches, allowing users to automate repetitive queries and receive regular updates on specific security metrics. This can be particularly valuable for configuring rule-based detections, monitoring key security indicators, and identifying trends over time. Flink: The Stream Processing Champion Apache Flink, on the other hand, takes a fundamentally different approach. It is a distributed processing engine designed to handle stateful computations over unbounded and bounded data streams. Unlike Splunk's batch processing, Flink excels at real-time processing, enabling it to analyze data as it arrives, offering near-instantaneous insights. This makes it ideal for scenarios where immediate detection and response are paramount, such as identifying ongoing security threats or preventing fraudulent transactions in real time. Flink's ability to scale horizontally across clusters makes it suitable for handling massive data volumes, a critical factor for organizations wrestling with ever-growing security data. Case Study: Detecting User Login Attacks Let's consider a practical example: a rule designed to detect potential brute-force login attempts. This rule aims to identify users who experience a high number of failed login attempts within a specific timeframe (e.g., an hour). Here's how the rule implementation would differ in Splunk and Flink: Splunk Implementation sourcetype=login_logs (result="failure" OR "failed") | stats count by user within 1h | search count > 5 | alert "Potential Brute Force Login Attempt for user: $user$" This Splunk search query filters login logs for failed attempts, calculates the count of failed attempts per user within an hour window, and then triggers an alert if the count exceeds a predefined threshold (5). While efficient for basic detection, it relies on batch processing, potentially introducing latency in identifying ongoing attacks. Flink Implementation SQL SELECT user, COUNT(*) AS failed_attempts FROM login_logs WHERE result = 'failure' OR result = 'failed' GROUP BY user, TUMBLE(event_time, INTERVAL '1 HOUR') HAVING failed_attempts > 5; Flink takes a more real-time approach. As each login event arrives, Flink checks the user and result. If it's a failed attempt, a counter for that user's window (1 hour) is incremented. If the count surpasses the threshold (5) within the window, Flink triggers an alert. This provides near-instantaneous detection of suspicious login activity. A Deep Dive: Splunk vs. Flink for Detecting User Login Attacks The underlying processing models of Splunk and Flink lead to fundamental differences in how they handle security incident detection. Here's a closer look at the key areas: Batch vs. Stream Processing Splunk Splunk operates on historical data. Security analysts write search queries that retrieve and analyze relevant logs. These queries can be configured to run periodically automatically. This is a batch processing approach, meaning Splunk needs to search through potentially a large volume of data to identify anomalies or trends. For the login attempt example, Splunk would need to query all login logs within the past hour every time the search is run to calculate the failed login count per user. This can introduce significant latency in detecting, and increase the cost of compute, especially when dealing with large datasets. Flink Flink analyzes data streams in real-time. As each login event arrives, Flink processes it immediately. This stream-processing approach allows Flink to maintain a continuous state and update it with each incoming event. In the login attempt scenario, Flink keeps track of failed login attempts per user within a rolling one-hour window. With each new login event, Flink checks the user and result. If it's a failed attempt, the counter for that user's window is incremented. This eliminates the need to query a large amount of historical data every time a check is needed. Windowing Splunk Splunk performs windowing calculations after retrieving all relevant logs. In our example, the search stats count by user within 1h retrieves all login attempts within the past hour and then calculates the count for each user. This approach can be inefficient for real-time analysis, especially as data volume increases. Flink Flink maintains a rolling window and continuously updates the state based on incoming events. Flink uses a concept called "time windows" to partition the data stream into specific time intervals (e.g., one hour). For each window, Flink keeps track of relevant information, such as the number of failed login attempts per user. As new data arrives, Flink updates the state for the current window. This eliminates the need for a separate post-processing step to calculate windowed aggregations. Alerting Infrastructure Splunk Splunk relies on pre-configured alerting actions within the platform. Splunk allows users to define search queries that trigger alerts when specific conditions are met. These alerts can be delivered through various channels such as email, SMS, or integrations with other security tools. Flink Flink might require integration with external tools for alerts. While Flink can identify anomalies in real time, it may not have built-in alerting functionalities like Splunk. Security teams often integrate Flink with external Security Information and Event Management (SIEM) solutions for alert generation and management. In essence, Splunk operates like a detective sifting through historical evidence, while Flink functions as a security guard constantly monitoring activity. Splunk is a valuable tool for forensic analysis and identifying historical trends. However, for real-time threat detection and faster response times, Flink's stream processing capabilities offer a significant advantage. Choosing the Right Tool: A Balancing Act While Splunk provides a user-friendly interface and simplifies rule creation, its batch processing introduces latency, which can be detrimental to real-time security needs. Flink excels in real-time processing and scalability, but it requires more technical expertise to set up and manage. Beyond Latency and Ease of Use: Additional Considerations The decision between Splunk and Flink goes beyond just real-time processing and ease of use. Here are some additional factors to consider: Data Volume and Variety Security teams are often overwhelmed by the sheer volume and variety of data they need to analyze. Splunk excels at handling structured data like logs but struggles with real-time ingestion and analysis of unstructured data like network traffic or social media feeds. Flink, with its distributed architecture, can handle diverse data types at scale. Alerting and Response Both Splunk and Flink can trigger alerts based on rule violations. However, Splunk integrates seamlessly with existing Security Information and Event Management (SIEM) systems, streamlining the incident response workflow. Flink might require additional development effort to integrate with external alerting and response tools. Cost Splunk's licensing costs are based on data ingestion volume, which can become expensive for organizations with massive security data sets. Flink, being open-source, eliminates licensing fees. However, the cost of technical expertise for setup, maintenance, and rule development for Flink needs to be factored in. The Evolving Security Landscape: A Hybrid Approach The security landscape is constantly evolving, demanding a multifaceted approach. Many organizations find value in a hybrid approach, leveraging the strengths of both Splunk and Flink. Splunk as the security hub: Splunk can serve as a central repository for security data, integrating logs from various sources, including real-time data feeds from Flink. Security analysts can utilize Splunk's powerful search capabilities for historical analysis, threat hunting, and investigation. Flink for real-time detection and response: Flink can be deployed for real-time processing of critical security data streams, focusing on identifying and responding to ongoing threats. This combination allows security teams to enjoy the benefits of both worlds: Comprehensive security visibility: Splunk provides a holistic view of historical and current security data. Real-time threat detection and response: Flink enables near-instantaneous identification and mitigation of ongoing security incidents. Conclusion: Choosing the Right Tool for the Job Neither Splunk nor Flink is a one-size-fits-all solution for rule-based incident detection. The optimal choice depends on your specific security needs, data volume, technical expertise, and budget. Security teams should carefully assess these factors and potentially consider a hybrid approach to leverage the strengths of both Splunk and Flink for a robust and comprehensive security posture. By understanding the strengths and weaknesses of each tool, security teams can make informed decisions about how to best utilize them to detect and respond to security threats in a timely and effective manner.
In many large organizations, software quality is primarily viewed as the responsibility of the testing team. When bugs slip through to production, or products fail to meet customer expectations, testers are the ones blamed. However, taking a closer look, quality — and likewise, failure — extends well beyond any one discipline. Quality is a responsibility shared across an organization. When quality issues arise, the root cause is rarely something testing alone could have prevented. Typically, there were breakdowns in communication, unrealistic deadlines, inadequate design specifications, insufficient training, or corporate governance policies that incentivized rushing. In other words, quality failures tend to stem from broader organizational and leadership failures. Scapegoating testers for systemic issues is counterproductive. It obscures the real problems and stands in the way of meaningful solutions to quality failings. Testing in Isolation In practice, all too often, testing teams still work in isolation from the rest of the product development lifecycle. They are brought in at the end, given limited information, and asked to validate someone else’s work. Under these conditions, their ability to prevent defects is severely constrained. For example, without access to product requirement documents, test cases may overlook critical functions that need validation. With short testing timelines, extensive test coverage further becomes impossible. Without insight into design decisions or access to developers, some defects found in testing prove impossible to diagnose effectively. Testers are often parachuted in when the time and cost of repairing a defect has grown to be unfeasible. In this isolated model, testing serves as little more than a final safety check before release. The burden of quality is passed almost entirely to the testers. When the inevitable bugs still slip through, testers then make for easy scapegoats. Who Owns Software Quality? In truth, responsibility for product quality is distributed across an organization. So, what can you do? Quality is everyone’s responsibility. Image sources: Kharnagy (Wikipedia), under CC BY-SA 4.0 license, combined with an image from Pixabay. Executives and leadership teams — Set the tone and policies around quality, balancing it appropriately against other priorities like cost and schedule. Meanwhile, provide the staffing, resources, and timescale needed for a mature testing effort. Product Managers — Gather user requirements, define expected functionality, and support test planning. Developers — Follow secure coding practices, perform unit testing, enable automated testing, and respond to defects uncovered in testing. User experience designers — Consider quality and testability during UX design. Conduct user acceptance testing on prototypes. Information security — Perform security reviews of code, architectures, and configurations. Guide testing-relevant security use cases. Testers — Develop test cases based on user stories, execute testing, log defects, perform regression test fixes, and report on quality to stakeholders. Operations — Monitor systems once deployed, gather production issues, and report data to inform future testing. Customers — Voice your true quality expectations, participate in UAT, and report real-world issues once launched. As this illustrates, no one functional area owns quality alone. Testers contribute essential verification, but quality is truly everyone’s responsibility. Governance Breakdowns Lead to Quality Failures In a 2023 episode of the "Why Didn’t You Test That?" podcast, Marcus Merrell, Huw Price, and I discussed how testing remains treated as a “janitorial” effort and cost center, and how you can align testing and quality. When organizations fail to acknowledge the shared ownership of software quality, governance issues arise that enable quality failures: Unrealistic deadlines — Attempting to achieve overly aggressive schedules often comes at the expense of quality and sufficient testing timelines. Leadership teams must balance market demands against release readiness. Insufficient investment — Success requires appropriate staffing and support for all areas that influence quality. These range from design and development to development to testing. Underinvestment leads to unhealthy tradeoffs. Lack of collaboration — Cross-functional coordination produces better quality than work done in silos. Governance policies should foster collaboration across product teams, not hinder it. Misaligned priorities — Leadership should incentivize balanced delivery, not just speed or cost savings. Quality cannot be someone else’s problem. Lack of transparency — Progress reporting should incorporate real metrics on quality. Burying or obscuring defects undermines governance. Absence of risk management — Identifying and mitigating quality risks through appropriate action requires focus from project leadership. Lacking transparency about risk prevents proper governance. When these governance breakdowns occur, quality suffers, and failures follow. However, the root causes trace back to organizational leadership and culture, not solely the testing function. The Costs of Obscuring Systemic Issues Blaming testers for failures caused by systemic organizational issues leads to significant costs: Loss of trust — When testers become scapegoats, it erodes credibility and trust in the testing function, inhibiting their ability to advocate for product quality. Staff turnover — Testing teams experience higher turnover when the broader organization fails to recognize their contributions and value. Less collaboration — Other groups avoid collaborating with testers perceived as bottlenecks or impediments rather than partners. Reinventing the wheel — Lessons from past governance breakdowns go unlearned, leading those issues to resurface in new forms down the line. Poorer customer experiences — Ultimately, obscuring governance issues around quality leads to more negative customer experiences that damage an organization’s reputation and bottom line. Taking Ownership of Software Quality Elevating quality as an organization-wide responsibility is essential for governance, transparency, and risk management. Quality cannot be the burden of one isolated function, and leadership should foster a culture that values quality intrinsically, rather than viewing it as an afterthought or checkbox. To build ownership, organizations need to shift testing upstream, integrating it earlier into requirements planning, design reviews, and development processes. It also requires modernizing the testing practice itself, utilizing the full range of innovation available: from test automation, shift-left testing, and service virtualization, to risk-based test case generation, modeling, and generative AI. With a shared understanding of who owns quality, governance policies can better balance competing demands around cost, schedule, capabilities, and release readiness. Testing insights will inform smarter tradeoffs, avoiding quality failures and the finger-pointing that today follows them. This future state reduces the likelihood of failures — but also acknowledges that some failures will still occur despite best efforts. In these cases, organizations must have a governance model to transparently identify root causes across teams, learn from them, and prevent recurrence. In a culture that values quality intrinsically, software testers earn their place as trusted advisors, rather than get relegated to fault-finders. They can provide oversight and validation of other teams’ work without fear of backlash. And their expertise will strengthen rather than threaten collaborative delivery. With shared ownership, quality ceases to be a “tester problem” at all. It becomes an organizational value that earns buy-in across functional areas. Leadership sets the tone for an understanding that if quality is everyone’s responsibility — so too is failure.
Failures in software systems are inevitable. How these failures are handled can significantly impact system performance, reliability, and the business’s bottom line. In this post, I want to discuss the upside of failure. Why you should seek failure, why failure is good, and why avoiding failure can reduce the reliability of your application. We will start with the discussion of fail-fast vs. fail-safe, this will take us to the second discussion about failures in general. As a side note, if you like the content of this and the other posts in this series check out my Debugging book that covers this subject. If you have friends that are learning to code I'd appreciate a reference to my Java Basics book. If you want to get back to Java after a while check out my Java 8 to 21 book. Fail-Fast Fail-fast systems are designed to immediately stop functioning upon encountering an unexpected condition. This immediate failure helps to catch errors early, making debugging more straightforward. The fail-fast approach ensures that errors are caught immediately. For example, in the world of programming languages, Java embodies this approach by producing a NullPointerException instantly when encountering a null value, stopping the system, and making the error clear. This immediate response helps developers identify and address issues quickly, preventing them from becoming more serious. By catching and stopping errors early, fail-fast systems reduce the risk of cascading failures, where one error leads to others. This makes it easier to contain and resolve issues before they spread through the system, preserving overall stability. It is easy to write unit and integration tests for fail-fast systems. This advantage is even more pronounced when we need to understand the test failure. Fail-fast systems usually point directly at the problem in the error stack trace. However, fail-fast systems carry their own risks, particularly in production environments: Production disruptions: If a bug reaches production, it can cause immediate and significant disruptions, potentially impacting both system performance and the business’s operations. Risk appetite: Fail-fast systems require a level of risk tolerance from both engineers and executives. They need to be prepared to handle and address failures quickly, often balancing this with potential business impacts. Fail-Safe Fail-safe systems take a different approach, aiming to recover and continue even in the face of unexpected conditions. This makes them particularly suited for uncertain or volatile environments. Microservices are a prime example of fail-safe systems, embracing resiliency through their architecture. Circuit breakers, both physical and software-based, disconnect failing functionality to prevent cascading failures, helping the system continue operating. Fail-safe systems ensure that systems can survive even harsh production environments, reducing the risk of catastrophic failure. This makes them particularly suited for mission-critical applications, such as in hardware devices or aerospace systems, where smooth recovery from errors is crucial. However, fail-safe systems have downsides: Hidden errors: By attempting to recover from errors, fail-safe systems can delay the detection of issues, making them harder to trace and potentially leading to more severe cascading failures. Debugging challenges: This delayed nature of errors can complicate debugging, requiring more time and effort to find and resolve issues. Choosing Between Fail-Fast and Fail-Safe It's challenging to determine which approach is better, as both have their merits. Fail-fast systems offer immediate debugging, lower risk of cascading failures, and quicker detection and resolution of bugs. This helps catch and fix issues early, preventing them from spreading. Fail-safe systems handle errors gracefully, making them better suited for mission-critical systems and volatile environments, where catastrophic failures can be devastating. Balancing Both To leverage the strengths of each approach, a balanced strategy can be effective: Fail-fast for local services: When invoking local services like databases, fail-fast can catch errors early, preventing cascading failures. Fail-safe for remote resources: When relying on remote resources, such as external web services, fail-safe can prevent disruptions from external failures. A balanced approach also requires clear and consistent implementation throughout coding, reviews, tooling, and testing processes, ensuring it is integrated seamlessly. Fail-fast can integrate well with orchestration and observability. Effectively, this moves the fail-safe aspect to a different layer of OPS instead of into the developer layer. Consistent Layer Behavior This is where things get interesting. It isn't about choosing between fail-safe and fail-fast. It's about choosing the right layer for them. E.g. if an error is handled in a deep layer using a fail-safe approach, it won't be noticed. This might be OK, but if that error has an adverse impact (performance, garbage data, corruption, security, etc.) then we will have a problem later on and won't have a clue. The right solution is to handle all errors in a single layer, in modern systems the top layer is the OPS layer and it makes the most sense. It can report the error to the engineers who are most qualified to deal with the error. But they can also provide immediate mitigation such as restarting a service, allocating additional resources, or reverting a version. Retry’s Are Not Fail-Safe Recently I was at a lecture where the speakers listed their updated cloud architecture. They chose to take a shortcut to microservices by using a framework that allows them to retry in the case of failure. Unfortunately, failure doesn't behave the way we would like. You can't eliminate it completely through testing alone. Retry isn't fail-safe. In fact: it can mean catastrophe. They tested their system and "it works", even in production. But let's assume that a catastrophic situation does occur, their retry mechanism can operate as a denial of service attack against their own servers. The number of ways in which ad-hoc architectures such as this can fail is mind-boggling. This is especially important once we redefine failures. Redefining Failure Failures in software systems aren't just about crashes. A crash can be seen as a simple and immediate failure, but there are more complex issues to consider. In fact, crashes in the age of containers are probably the best failures. A system restarts seamlessly with barely an interruption. Data Corruption Data corruption is far more severe and insidious than a crash. It carries with it long-term consequences. Corrupted data can lead to security and reliability problems that are challenging to fix, requiring extensive reworking and potentially unrecoverable data. Cloud computing has led to defensive programming techniques, like circuit breakers and retries, emphasizing comprehensive testing and logging to catch and handle failures gracefully. In a way, this environment sent us back in terms of quality. A fail-fast system at the data level could stop this from happening. Addressing a bug goes beyond a simple fix. It requires understanding its root cause and preventing reoccurrence, extending into comprehensive logging, testing, and process improvements. This ensures that the bug is fully addressed, reducing the chances of it reoccurring. Don't Fix the Bug If it's a bug in production you should probably revert, if you can't instantly revert production. This should always be possible and if it isn't this is something you should work on. Failures must be fully understood before a fix is undertaken. In my own companies, I often skipped that step due to pressure, in a small startup that is forgivable. In larger companies, we need to understand the root cause. A culture of debriefing for bugs and production issues is essential. The fix should also include process mitigation that prevents similar issues from reaching production. Debugging Failure Fail-fast systems are much easier to debug. They have inherently simpler architecture and it is easier to pinpoint an issue to a specific area. It is crucial to throw exceptions even for minor violations (e.g. validations). This prevents cascading types of bugs that prevail in loose systems. This should be further enforced by unit tests that verify the limits we define and verify proper exceptions are thrown. Retries should be avoided in the code as they make debugging exceptionally difficult and their proper place is in the OPS layer. To facilitate that further, timeouts should be short by default. Avoiding Cascading Failure Failure isn't something we can avoid, predict, or fully test against. The only thing we can do is soften the blow when a failure occurs. Often this "softening" is achieved by using long-running tests meant to replicate extreme conditions as much as possible with the goal of finding our application's weak spots. This is rarely enough, robust systems need to revise these tests often based on real production failures. A great example of a fail-safe would be a cache of REST responses that lets us keep working even when a service is down. Unfortunately, this can lead to complex niche issues such as cache poisoning or a situation in which a banned user still had access due to cache. Hybrid in Production Fail-safe is best applied only in production/staging and in the OPS layer. This reduces the amount of changes between production and dev, we want them to be as similar as possible, yet it's still a change that can negatively impact production. However, the benefits are tremendous as observability can get a clear picture of system failures. The discussion here is a bit colored by my more recent experience of building observable cloud architectures. However, the same principle applies to any type of software whether embedded or in the cloud. In such cases we often choose to implement fail-safe in the code, in this case, I would suggest implementing it consistently and consciously in a specific layer. There's also a special case of libraries/frameworks that often provide inconsistent and badly documented behaviors in these situations. I myself am guilty of such inconsistency in some of my work. It's an easy mistake to make. Final Word This is my last post on the theory of debugging series that's part of my book/course on debugging. We often think of debugging as the action we take when something fails, it isn't. Debugging starts the moment we write the first line of code. We make decisions that will impact the debugging process as we code, often we're just unaware of these decisions until we get a failure. I hope this post and series will help you write code that is prepared for the unknown. Debugging, by its nature, deals with the unexpected. Tests can't help. But as I illustrated in my previous posts, there are many simple practices we can undertake that would make it easier to prepare. This isn't a one-time process, it's an iterative process that requires re-evaluation of decisions made as we encounter failure.
In the dynamic landscape of modern technology, the realm of Incident Management stands as a crucible where professionals are tested and refined. Incidents, ranging from minor hiccups to critical system failures, are not mere disruptions but opportunities for growth and learning. Within this crucible, we have traversed the challenging terrain of Incident Management. The collective experiences and insights offer a treasure trove of wisdom, illuminating the path for personal and professional development. In this article, we delve deep into the core principles and lessons distilled from the crucible of Incident Management. Beyond the technical intricacies lies a tapestry of skills and virtues—adaptability, resilience, effective communication, collaborative teamwork, astute problem-solving, and a relentless pursuit of improvement. These are the pillars upon which successful incident response is built, shaping not just careers but entire mindsets and approaches to life's challenges. Through real-world anecdotes and practical wisdom, we unravel the transformative power of Incident Management. Join us on this journey of discovery, where each incident is not just a problem to solve but a stepping stone towards personal and professional excellence. Incident Management Essentials: Navigating Through Challenges Incident Management is a multifaceted discipline that requires a strategic approach and a robust set of skills to navigate through various challenges effectively. At its core, Incident Management revolves around the swift and efficient resolution of unexpected issues that can disrupt services, applications, or systems. One of the fundamental aspects of Incident Management is the ability to prioritize incidents based on their impact and severity. This involves categorizing incidents into different levels of urgency and criticality, akin to triaging patients in a hospital emergency room. By prioritizing incidents appropriately, teams can allocate resources efficiently, focus efforts where they are most needed, and minimize the overall impact on operations and user experience. Clear communication channels are another critical component of Incident Management. Effective communication ensures that all stakeholders, including technical teams, management, customers, and other relevant parties, are kept informed throughout the incident lifecycle. Transparent and timely communication not only fosters collaboration but also instills confidence in stakeholders that the situation is being addressed proactively. Collaboration and coordination are key pillars of successful incident response. Incident Management often involves cross-functional teams working together to diagnose, troubleshoot, and resolve issues. Collaboration fosters collective problem-solving, encourages knowledge sharing, and enables faster resolution times. Additionally, establishing well-defined roles, responsibilities, and escalation paths ensures a streamlined and efficient response process. Proactive monitoring and alerting systems play a crucial role in Incident Management. Early detection of anomalies, performance issues, or potential failures allows teams to intervene swiftly before they escalate into full-blown incidents. Implementing robust monitoring tools, setting up proactive alerts, and conducting regular health checks are essential proactive measures to prevent incidents or mitigate their impact. Furthermore, incident documentation and post-mortem analysis are integral parts of Incident Management. Documenting incident details, actions taken, resolutions, and lessons learned not only provides a historical record but also facilitates continuous improvement. Post-incident analysis involves conducting a thorough root cause analysis, identifying contributing factors, and implementing corrective measures to prevent similar incidents in the future. In essence, navigating through challenges in Incident Management requires a blend of technical expertise, strategic thinking, effective communication, collaboration, proactive monitoring, and a culture of continuous improvement. By mastering these essentials, organizations can enhance their incident response capabilities, minimize downtime, and deliver superior customer experiences. Learning from Challenges: The Post-Incident Analysis The post-incident analysis phase is a critical component of Incident Management that goes beyond resolving the immediate issue. It serves as a valuable opportunity for organizations to extract meaningful insights, drive continuous improvement, and enhance resilience against future incidents. Here are several key points to consider during the post-incident analysis: Root Cause Analysis (RCA) Conducting a thorough RCA is essential to identify the underlying factors contributing to the incident. This involves tracing back the chain of events, analyzing system logs, reviewing configurations, and examining code changes to pinpoint the root cause accurately. RCA helps in addressing the core issues rather than just addressing symptoms, thereby preventing recurrence. Lessons Learned Documentation Documenting lessons learned from each incident is crucial for knowledge management and organizational learning. Capture insights, observations, and best practices discovered during the incident response process. This documentation serves as a valuable resource for training new team members, refining incident response procedures, and avoiding similar pitfalls in the future. Process Improvement Recommendations Use the findings from post-incident analysis to recommend process improvements and optimizations. This could include streamlining communication channels, revising incident response playbooks, enhancing monitoring and alerting thresholds, automating repetitive tasks, or implementing additional failover mechanisms. Continuous process refinement ensures a more effective and efficient incident response framework. Cross-Functional Collaboration Involve stakeholders from various departments, including technical teams, management, quality assurance, and customer support, in the post-incident analysis discussions. Encourage open dialogue, share insights, and solicit feedback from diverse perspectives. Collaborative analysis fosters a holistic understanding of incidents and promotes collective ownership of incident resolution and prevention efforts. Implementing Corrective and Preventive Actions (CAPA) Based on the findings of the post-incident analysis, prioritize and implement corrective actions to address immediate vulnerabilities or gaps identified. Additionally, develop preventive measures to mitigate similar risks in the future. CAPA initiatives may include infrastructure upgrades, software patches, security enhancements, or policy revisions aimed at strengthening resilience and reducing incident frequency. Continuous Monitoring and Feedback Loop Establish a continuous monitoring mechanism to track the effectiveness of implemented CAPA initiatives. Monitor key metrics such as incident recurrence rates, mean time to resolution (MTTR), customer satisfaction scores, and overall system stability. Solicit feedback from stakeholders and iterate on improvements iteratively to refine incident response capabilities over time. By embracing a comprehensive approach to post-incident analysis, organizations can transform setbacks into opportunities for growth, innovation, and enhanced operational excellence. The insights gleaned from each incident serve as stepping stones towards building a more resilient and proactive incident management framework. Enhancing Post-Incident Analysis With AI The integration of Artificial Intelligence is revolutionizing Post-Incident Analysis, offering advanced capabilities that significantly augment traditional approaches. Here's how AI can elevate the PIA process: Pattern Recognition and Incident Detection AI algorithms excel in analyzing extensive historical data to identify patterns indicative of potential incidents. By detecting anomalies in system behavior or recognizing error patterns in logs, AI efficiently flags potential incidents for further investigation. This automated incident detection streamlines identification efforts, reducing manual workload and response times. Advanced Root Cause Analysis (RCA) AI algorithms are adept at processing complex data sets and correlating multiple variables. In RCA, AI plays a pivotal role in pinpointing the root cause of incidents by analyzing historical incident data, system logs, configuration changes, and performance metrics. This in-depth analysis facilitated by AI accelerates the identification of underlying issues, leading to more effective resolutions and preventive measures. Predictive Analysis and Proactive Measures Leveraging historical incident data and trends, AI-driven predictive analysis forecasts potential issues or vulnerabilities. By identifying emerging patterns or risk factors, AI enables proactive measures to mitigate risks before they escalate into incidents. This proactive stance not only reduces incident frequency and severity but also enhances overall system reliability and stability. Continuous Improvement via AI Insights AI algorithms derive actionable insights from post-incident analysis data. By evaluating the effectiveness of implemented corrective and preventive actions (CAPA), AI offers valuable feedback on intervention impact. These insights drive ongoing process enhancements, empowering organizations to refine incident response strategies, optimize resource allocation, and continuously enhance incident management capabilities. Integrating AI into Post-Incident Analysis empowers organizations with data-driven insights, automation of repetitive tasks, and proactive risk mitigation, fostering a culture of continuous improvement and resilience in Incident Management. Applying Lessons Beyond Work: Personal Growth and Resilience The skills and lessons gained from Incident Management are highly transferable to various aspects of life. For instance, adaptability is crucial not only in responding to technical issues but also in adapting to changes in personal circumstances or professional environments. Teamwork teaches collaboration, conflict resolution, and empathy, which are essential in building strong relationships both at work and in personal life. Problem-solving skills honed during incident response can be applied to tackle challenges in any domain, from planning a project to resolving conflicts. Resilience, the ability to bounce back from setbacks, is a valuable trait that helps individuals navigate through adversity with determination and a positive mindset. Continuous improvement is a mindset that encourages individuals to seek feedback, reflect on experiences, identify areas for growth, and strive for excellence. This attitude of continuous learning and development not only benefits individuals in their careers but also contributes to personal fulfillment and satisfaction. Dispelling Misconceptions: What Lessons Learned Isn't We highlight common misconceptions about lessons learned, clarifying that it's not about: Emergency mindset: Lessons learned don't advocate for a perpetual emergency mindset but emphasize preparedness and maintaining a healthy, sustainable pace in incident response and everyday operations. Assuming all situations are crises: It's essential to discern between true emergencies and everyday challenges, avoiding unnecessary stress and overreaction to non-critical issues. Overemphasis on structure and protocol: While structure and protocols are important, rigid adherence can stifle flexibility and outside-the-box thinking. Lessons learned encourage a balance between following established procedures and embracing innovation. Decisiveness at the expense of deliberation: Rapid decision-making is crucial during incidents, but rushing decisions can lead to regrettable outcomes. It's about finding the right balance between acting swiftly and ensuring thorough deliberation to avoid hasty or ill-informed decisions. Short-term focus: Lessons learned extend beyond immediate goals and short-term fixes. It promotes a long-term perspective, strategic planning, and continuous improvement to address underlying issues and prevent recurring incidents. Minimizing risk to the point of stagnation: While risk mitigation is important, excessive risk aversion can lead to missed opportunities for growth and innovation. Lessons learned encourage a proactive approach to risk management that balances security with strategic decision-making. One-size-fits-all approach: Responses to incidents and lessons learned should be tailored to the specific circumstances and individuals involved. Avoiding a one-size-fits-all approach ensures that solutions are effective, relevant, and scalable across diverse scenarios. Embracing Growth: Conclusion In conclusion, Incident Management is more than just a set of technical processes or procedures. It's a mindset, a culture, and a journey of continuous growth and improvement. By embracing the core principles of adaptability, communication, teamwork, problem-solving, resilience, and continuous improvement, individuals can not only excel in their professional roles but also lead more fulfilling and meaningful lives.
It is crucial to guarantee the availability and resilience of vital infrastructure in the current digital environment. The preferred platform for container orchestration, Kubernetes offers scalability, flexibility, and resilience. But much like any technology, Kubernetes clusters can malfunction—from natural calamities to hardware malfunctions. The implementation of a catastrophe backup strategy is necessary in order to limit the risk of data loss and downtime. We’ll look at how to set up a catastrophe backup for a Kubernetes cluster in this article. Understanding the Importance of Disaster Backup Before delving into the implementation details, let’s underscore why disaster backup is crucial for Kubernetes clusters: 1. Data Protection Data loss prevention: A disaster backup strategy ensures that critical data stored within Kubernetes clusters is protected against loss due to unforeseen events. Compliance requirements: Many industries have strict data retention and recovery regulations. Implementing disaster backup helps organizations meet compliance standards. 2. Business Continuity Minimize downtime: With a robust backup strategy in place, organizations can quickly recover from disasters, minimizing downtime and maintaining business continuity. Reputation management: Rapid recovery from disasters helps uphold the organization’s reputation and customer trust. 3. Risk Mitigation Identifying vulnerabilities: Disaster backup planning involves identifying vulnerabilities within the Kubernetes infrastructure and addressing them proactively. Cost savings: While implementing disaster backup incurs initial costs, it can save significant expenses associated with downtime and data loss in the long run. Implementing Disaster Backup for Kubernetes Cluster Now, let’s outline a step-by-step approach to implementing disaster backup for a Kubernetes cluster: 1. Backup Strategy Design Define Recovery Point Objective (RPO) and Recovery Time Objective (RTO): Determine the acceptable data loss and downtime thresholds for your organization. Select backup tools: Choose appropriate backup tools compatible with Kubernetes, such as Velero, Kasten K10, or OpenEBS. Backup frequency: Decide on the frequency of backups based on the RPO and application requirements. 2. Backup Configuration Identify critical workloads: Prioritize backup configurations for critical workloads and persistent data. Backup storage: Set up reliable backup storage solutions, such as cloud object storage (e.g., Amazon S3, Google Cloud Storage) or on-premises storage with redundancy. Retention policies: Define retention policies for backups to ensure optimal storage utilization and compliance. 3. Testing and Validation Regular testing: Conduct regular backup and restore tests to validate the effectiveness of the disaster recovery process. Automated testing: Implement automated testing procedures to simulate disaster scenarios and assess the system’s response. 4. Monitoring and Alerting Monitoring tools: Utilize monitoring tools like Prometheus and Grafana to track backup status, storage utilization, and performance metrics. Alerting mechanisms: Configure alerting mechanisms to notify administrators of backup failures or anomalies promptly. 5. Documentation and Training Comprehensive documentation: Document the disaster backup procedures, including backup schedules, recovery processes, and contact information for support. Training sessions: Conduct training sessions for relevant personnel to ensure they understand their roles and responsibilities during disaster recovery efforts. Implementing a disaster backup strategy is critical for safeguarding Kubernetes clusters against unforeseen events. By following the steps outlined in this guide, organizations can enhance data protection, ensure business continuity, and mitigate risks effectively. Remember, proactive planning and regular testing are key to maintaining the resilience of Kubernetes infrastructure in the face of disasters. Ensure the safety and resilience of your Kubernetes cluster today by implementing a robust disaster backup strategy! Additional Considerations 1. Geographic Redundancy Multi-region Deployment: Consider deploying Kubernetes clusters across multiple geographic regions to enhance redundancy and disaster recovery capabilities. Geo-Replication: Utilize geo-replication features offered by cloud providers to replicate data across different regions for improved resilience. 2. Disaster Recovery Drills Regular Drills: Conduct periodic disaster recovery drills to evaluate the effectiveness of backup and recovery procedures under real-world conditions. Scenario-Based Testing: Simulate various disaster scenarios, such as network outages or data corruption, to identify potential weaknesses in the disaster recovery plan. 3. Continuous Improvement Feedback mechanisms: Establish feedback mechanisms to gather insights from disaster recovery drills and real-world incidents, enabling continuous improvement of the backup strategy. Technology evaluation: Stay updated with the latest advancements in backup and recovery technologies for Kubernetes to enhance resilience and efficiency. Future Trends and Innovations As Kubernetes continues to evolve, so do the methodologies and technologies associated with disaster backup and recovery. Some emerging trends and innovations in this space include: Immutable infrastructure: Leveraging immutable infrastructure principles to ensure that backups are immutable and tamper-proof, enhancing data integrity and security. Integration with AI and ML: Incorporating artificial intelligence (AI) and machine learning (ML) algorithms to automate backup scheduling, optimize storage utilization, and predict potential failure points. Serverless backup solutions: Exploring serverless backup solutions that eliminate the need for managing backup infrastructure, reducing operational overhead and complexity. By staying abreast of these trends and adopting innovative approaches, organizations can future-proof their disaster backup strategies and effectively mitigate risks in an ever-changing landscape. Final Thoughts The significance of catastrophe backup in an era characterized by digital transformation and an unparalleled dependence on cloud-native technologies such as Kubernetes cannot be emphasized. Investing in strong backup and recovery procedures is crucial for organizations navigating the complexity of contemporary IT infrastructures in order to protect sensitive data and guarantee continuous business operations. Recall that catastrophe recovery is a continuous process rather than a one-time event. Organizations may confidently and nimbly handle even the most difficult situations by adopting best practices, utilizing cutting-edge technologies, and cultivating a resilient culture. By taking preventative action now, you can safeguard your Kubernetes cluster against future catastrophes and provide the foundation for a robust and successful future!
Let's imagine we have an app installed on a Linux server in the cloud. This app uses a list of user proxies to establish an internet connection through them and perform operations with online resources. The Problem Sometimes, the app has connection errors. These errors are common, but it's unclear whether they stem from a bug in the app, issues with the proxies, network/OS conditions on the server (where the app is running), or just specific cases that don't generate a particular error message. These errors only occur sometimes and not with every proxy but with many different ones (SSH, SOCKS, HTTP(s), with and without UDP), providing no direct clues that the proxies are the cause. Additionally, it happens at a specific time of day (but this might be a coincidence). The only information available is a brief report from a user, lacking details. Short tests across different environments with various proxies and network conditions haven’t reproduced the problem, but the user claims it still occurs. The Solution Rent the same server with the same configuration. Install the same version of the app. Run tests for 24+ hours to emulate the user's actions. Gather as much information as possible (all logs – app logs, user (test) action logs, used proxies, etc.) in a way that makes it possible to match IDs and obtain technical details in case of errors. The Task Write some tests with logs. Find a way to save all the log data. To make it more challenging, I'll introduce a couple of additional obstacles and assume limited resources and a deadline. By the way, this scenario is based on a real-world experience of mine, with slight twists and some details omitted (which are not important for the point). Testing Scripts and Your Logs I'll start with the simplest, most intuitive method for beginner programmers: when you perform actions in your scripts, you need to log specific information: Python output_file_path = "output_test_script.txt" def start(): # your function logic print(f'start: {response.content}') with open(output_file_path, "a") as file: file.write(f'uuid is {uuid} -- {response.content} \n') def stop(): # your function logic print(local_api_data_stop, local_api_stop_response.content) with open(output_file_path, "a") as file: file.write(f'{uuid} -- {response.content} \n') # your other functions and logic if __name__ == "__main__": with open(output_file_path, "w") as file: pass Continuing, you can use print statements and save information on actions, responses, IDs, counts, etc. This approach is straightforward, simple, and direct, and it will work in many cases. However, logging everything in this manner is not considered best practice. Instead, you can utilize the built-in logging module for a more structured and efficient logging approach. Python import logging # logger object logger = logging.getLogger('example') logger.setLevel(logging.DEBUG) # file handler fh = logging.FileHandler('example.log') fh.setLevel(logging.DEBUG) # formatter, set it for the handlers formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') fh.setFormatter(formatter) ch.setFormatter(formatter) # add the handlers to the logger logger.addHandler(fh) logger.addHandler(ch) # logging messages logger.debug('a debug message') logger.info('an info message') logger.warning('a warning message') logger.error('an error message') Details. The first task is DONE! Let's consider a case where your application has a debug log feature — it rotates with 3 files, each capped at 1MB. Typically, this config is sufficient. But during extended testing sessions lasting 24hrs+, with heavy activity, you may find yourself losing valuable logs due to this configuration. To deal with this issue, you might modify the application to have larger debug log files. However, this would necessitate a new build and various other adjustments. This solution may indeed be optimal, yet there are instances where such a straightforward option isn’t available. For example, you might use a server setup with restricted access, etc. In such cases, you may need to find alternative approaches or workarounds. Using Python, you can write a script to transfer information from the debug log to a log file without size restrictions or rotation limitations. A basic implementation could be as follows: Python import time def read_new_lines(log_file, last_position): with open(log_file, 'r') as file: file.seek(last_position) new_data = file.read() new_position = file.tell() return new_data, new_position def copy_new_logs(log_file, output_log): last_position = 0 while True: new_data, last_position = read_new_lines(log_file, last_position) if new_data: with open(output_log, 'a') as output_file: output_file.write(new_data) time.sleep(1) source_log_file = 'debug.log' output_log = 'combined_log.txt' copy_new_logs(source_log_file, output_log) Now let's assume that Python isn't an option on the server for some reason — perhaps installation isn't possible due to time constraints, permission limitations, or conflicts with the operating system and you don’t know how to fix it. In such cases, using bash is the right choice: Python #!/bin/bash source_log_file="debug.log" output_log="combined_log.txt" copy_new_logs() { while true; do tail -F -n +1 "$source_log_file" >> "$output_log" sleep 1 done } trap "echo 'Interrupted! Exiting...' && exit" SIGINT copy_new_logs The second task is DONE! With your detailed logs combined with the app's logs, you now have comprehensive debug information to understand the sequence of events. This includes IDs, proxies, test data, etc along with the actions taken and the used proxies. You can run your scripts for long hours without constant supervision. The only task remaining is to analyze the debug logs to get statistics and potential info on the root cause of any issues, if they even can be replicated according to user reports. Some issues required thorough testing and detailed logging. By replicating the users’ setup and running extensive tests, we can gather important data for pinpointing bugs. Whether using Python or bash scripts (or any other PL), our focus on capturing detailed logs enables us to identify the root causes of errors and troubleshoot effectively. This highlights the importance of detailed logging in reproducing complex technical bugs and issues.
User-defined functions (UDFs) are a very useful feature supported in SQL++ (UDF documentation). Couchbase 7.6 introduces improvements that allow for more debuggability and visibility into UDF execution. This blog will explore two new features in Couchbase 7.6 in the world of UDFs: Profiling for SQL++ statements executed in JavaScript UDFs EXPLAIN FUNCTION to access query plans of SQL++ statements within UDFs The examples in this blog require the travel-sample dataset to be installed. Documentation to install sample buckets Profiling SQL++ Executed in JavaScript UDFs Query profiling is a debuggability feature that SQL++ offers. When profiling is enabled for a statement’s execution, the result of the request includes a detailed execution tree with timing and metrics of each step of the statement’s execution. In addition to the profiling information being returned in the results of the statement, it can also be accessed for the request in the system:active_requests and system:completed_requests system keyspaces. To dive deeper into request profiling, see request profiling in SQL++. In Couchbase 7.0, profiling was included for subqueries. This included profiling subqueries that were within Inline UDFs. However, in versions before Couchbase 7.6, profiling was not extended to SQL++ statements within JavaScript UDFs. In earlier versions, to profile statements within a JavaScript UDF, the user would be required to open up the function’s definition, individually run each statement within the UDF, and collect their profiles. This additional step will no longer be needed in 7.6.0! Now, when profiling is enabled, if the statement contains JavaScript UDF execution, profiles for all SQL++ statements executed in the UDF will also be collected. This UDF-related profiling information will be available in the request output, system:active_requests and system:completed_requests system keyspaces as well. Example 1 Create a JavaScript UDF “js1” in a global library “lib1” via the REST endpoint or via the UI. JavaScript function js1() { var query = SELECT * FROM default:`travel-sample`.inventory.airline LIMIT 1; var res = []; for (const row of query) { res.push(row); } query.close() return res; } Create the corresponding SQL++ function. SQL CREATE FUNCTION js1() LANGUAGE JAVASCRIPT AS "js1" AT "lib1"; Execute the UDF with profiling enabled. SQL EXECUTE FUNCTION js1(); The response to the statement above will contain the following: In the profile section of the returned response, the executionTimings subsection contains a field ~udfStatements. ~udfStatements: An array of profiling information that contains an entry for every SQL++ statement within the JavaScript UDF Every entry within the ~udfStatements section contains: executionTimings: This is the execution tree for the statement. It has metrics and timing information for every step of the statement’s execution. statement: The statement string function: This is the name of the function where the statement was executed and is helpful to identify the UDF that executed the statement when there are nested UDF executions. JavaScript { "requestID": "2c5576b5-f01d-445f-a35b-2213c606f394", "signature": null, "results": [ [ { "airline": { "callsign": "MILE-AIR", "country": "United States", "iata": "Q5", "icao": "MLA", "id": 10, "name": "40-Mile Air", "type": "airline" } } ] ], "status": "success", "metrics": { "elapsedTime": "20.757583ms", "executionTime": "20.636792ms", "resultCount": 1, "resultSize": 310, "serviceLoad": 2 }, "profile": { "phaseTimes": { "authorize": "12.835µs", "fetch": "374.667µs", "instantiate": "27.75µs", "parse": "251.708µs", "plan": "9.125µs", "primaryScan": "813.249µs", "primaryScan.GSI": "813.249µs", "project": "5.541µs", "run": "27.925833ms", "stream": "26.375µs" }, "phaseCounts": { "fetch": 1, "primaryScan": 1, "primaryScan.GSI": 1 }, "phaseOperators": { "authorize": 2, "fetch": 1, "primaryScan": 1, "primaryScan.GSI": 1, "project": 1, "stream": 1 }, "cpuTime": "468.626µs", "requestTime": "2023-12-04T20:30:00.369+05:30", "servicingHost": "127.0.0.1:8091", "executionTimings": { "#operator": "Authorize", "#planPreparedTime": "2023-12-04T20:30:00.369+05:30", "#stats": { "#phaseSwitches": 4, "execTime": "1.918µs", "servTime": "1.125µs" }, "privileges": { "List": [] }, "~child": { "#operator": "Sequence", "#stats": { "#phaseSwitches": 2, "execTime": "2.208µs" }, "~children": [ { "#operator": "ExecuteFunction", "#stats": { "#itemsOut": 1, "#phaseSwitches": 4, "execTime": "22.375µs", "kernTime": "20.271708ms" }, "identity": { "name": "js1", "namespace": "default", "type": "global" } }, { "#operator": "Stream", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 2, "execTime": "26.375µs" }, "serializable": true } ] }, "~udfStatements": [ { "executionTimings": { "#operator": "Authorize", "#stats": { "#phaseSwitches": 4, "execTime": "2.626µs", "servTime": "7.166µs" }, "privileges": { "List": [ { "Priv": 7, "Props": 0, "Target": "default:travel-sample.inventory.airline" } ] }, "~child": { "#operator": "Sequence", "#stats": { "#phaseSwitches": 2, "execTime": "4.375µs" }, "~children": [ { "#operator": "PrimaryScan3", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 7, "execTime": "22.082µs", "kernTime": "1.584µs", "servTime": "791.167µs" }, "bucket": "travel-sample", "index": "def_inventory_airline_primary", "index_projection": { "primary_key": true }, "keyspace": "airline", "limit": "1", "namespace": "default", "optimizer_estimates": { "cardinality": 187, "cost": 45.28617059639748, "fr_cost": 12.1780009122802, "size": 12 }, "scope": "inventory", "using": "gsi" }, { "#operator": "Fetch", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 10, "execTime": "18.376µs", "kernTime": "797.542µs", "servTime": "356.291µs" }, "bucket": "travel-sample", "keyspace": "airline", "namespace": "default", "optimizer_estimates": { "cardinality": 187, "cost": 192.01699202888378, "fr_cost": 24.89848658838975, "size": 204 }, "scope": "inventory" }, { "#operator": "InitialProject", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 7, "execTime": "5.541µs", "kernTime": "1.1795ms" }, "discard_original": true, "optimizer_estimates": { "cardinality": 187, "cost": 194.6878862611588, "fr_cost": 24.912769445246838, "size": 204 }, "preserve_order": true, "result_terms": [ { "expr": "self", "star": true } ] }, { "#operator": "Limit", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 4, "execTime": "6.25µs", "kernTime": "333ns" }, "expr": "1", "optimizer_estimates": { "cardinality": 1, "cost": 24.927052302103924, "fr_cost": 24.927052302103924, "size": 204 } }, { "#operator": "Receive", "#stats": { "#phaseSwitches": 3, "execTime": "10.324833ms", "kernTime": "792ns", "state": "running" } } ] } }, "statement": "SELECT * FROM default:`travel-sample`.inventory.airline LIMIT 1;", "function": "default:js1" } ], "~versions": [ "7.6.0-N1QL", "7.6.0-1847-enterprise" ] } } } Query Plans With EXPLAIN FUNCTION SQL++ offers another wonderful capability to access the plan of a statement with the EXPLAIN statement. However, the EXPLAIN statement does not extend to plans of statements within UDFs, neither inline nor JavaScript UDFs. In earlier versions, to analyze the query plans for SQL++ within a UDF, it would require the user to open the function’s definition and individually run an EXPLAIN on all the statements within the UDF. These extra steps will be minimized in Couchbase 7.6 with the introduction of a new statement: EXPLAIN FUNCTION. This statement does exactly what EXPLAIN does, but for SQL++ statements within a UDF. Let’s explore how to use the EXPLAIN FUNCTION statement! Syntax explain_function ::= 'EXPLAIN' 'FUNCTION' function function refers to the name of the function. For more detailed information on syntax, please check out the documentation. Prerequisites To execute EXPLAIN FUNCTION, the user requires the correct RBAC permissions. To run EXPLAIN FUNCTION on a UDF, the user must have sufficient RBAC permissions to execute the function. The user must also have the necessary RBAC permissions to execute the SQL++ statements within the UDF function body as well. For more information, refer to the documentation regarding roles supported in Couchbase. Inline UDF EXPLAIN FUNCTION on an inline UDF will return the query plans of all the subqueries within its definition (see inline function documentation). Example 2: EXPLAIN FUNCTION on an Inline Function Create an inline UDF and run EXPLAIN FUNCTION on it. SQL CREATE FUNCTION inline1() { ( SELECT * FROM default:`travel-sample`.inventory.airport WHERE city = "Zachar Bay" ) }; SQL EXPLAIN FUNCTION inline1(); The results of the above statement will contain: function: The name of the function on which EXPLAIN FUNCTION was run plans: An array of plan information that contains an entry for every subquery within the inline UDF JavaScript { "function": "default:inline1", "plans": [ { "cardinality": 1.1176470588235294, "cost": 25.117642854609013, "plan": { "#operator": "Sequence", "~children": [ { "#operator": "IndexScan3", "bucket": "travel-sample", "index": "def_inventory_airport_city", "index_id": "2605c88c115dd3a2", "index_projection": { "primary_key": true }, "keyspace": "airport", "namespace": "default", "optimizer_estimates": { "cardinality": 1.1176470588235294, "cost": 12.200561852726496, "fr_cost": 12.179450078755286, "size": 12 }, "scope": "inventory", "spans": [ { "exact": true, "range": [ { "high": "\\"Zachar Bay\\"", "inclusion": 3, "index_key": "`city`", "low": "\\"Zachar Bay\\"" } ] } ], "using": "gsi" }, { "#operator": "Fetch", "bucket": "travel-sample", "keyspace": "airport", "namespace": "default", "optimizer_estimates": { "cardinality": 1.1176470588235294, "cost": 25.082370508382763, "fr_cost": 24.96843677065826, "size": 249 }, "scope": "inventory" }, { "#operator": "Parallel", "~child": { "#operator": "Sequence", "~children": [ { "#operator": "Filter", "condition": "((`airport`.`city`) = \\"Zachar Bay\\")", "optimizer_estimates": { "cardinality": 1.1176470588235294, "cost": 25.100006681495888, "fr_cost": 24.98421650449632, "size": 249 } }, { "#operator": "InitialProject", "discard_original": true, "optimizer_estimates": { "cardinality": 1.1176470588235294, "cost": 25.117642854609013, "fr_cost": 24.99999623833438, "size": 249 }, "result_terms": [ { "expr": "self", "star": true } ] } ] } } ] }, "statement": "select self.* from `default`:`travel-sample`.`inventory`.`airport` where ((`airport`.`city`) = \\"Zachar Bay\\")" } ] } JavaScript UDF SQL++ statements within JavaScript UDFs can be of two types as listed below. EXPLAIN FUNCTION works differently based on the way the SQL++ statement is called. Refer to the documentation to learn more about calling SQL++ in JavaScript functions. 1. Embedded SQL++ Embedded SQL++ is “embedded” in the function body and its detection is handled by the JavaScript transpiler. EXPLAIN FUNCTION can return query plans for embedded SQL++ statements. 2. SQL++ Executed by the N1QL() Function Call SQL++ can also be executed by passing a statement in the form of a string as an argument to the N1QL() function. When parsing the function for potential SQL++ statements to run the EXPLAIN on, it is difficult to get the dynamic string in the function argument. This can only be reliably resolved at runtime. With this reasoning, EXPLAIN FUNCTION does not return the query plans for SQL++ statements executed via N1QL() calls, but instead, returns the line numbers where the N1QL() function calls have been made. This line number is calculated from the beginning of the function definition. The user can then map the line numbers in the actual function definition and investigate further. Example 3: EXPLAIN FUNCTION on an External JavaScript Function Create a JavaScript UDF “js2” in a global library “lib1” via the REST endpoint or via the UI. JavaScript function js2() { // SQL++ executed by a N1QL() function call var query1 = N1QL("UPDATE default:`travel-sample` SET test = 1 LIMIT 1"); // Embedded SQL++ var query2 = SELECT * FROM default:`travel-sample` LIMIT 1; var res = []; for (const row of query2) { res.push(row); } query2.close() return res; } Create the corresponding SQL++ function. SQL CREATE FUNCTION js2() LANGUAGE JAVASCRIPT AS "js2" AT "lib1"; Run EXPLAIN FUNCTION on the SQL++ function. SQL EXPLAIN FUNCTION js2; The results of the statement above will contain: function: The name of the function on which EXPLAIN FUNCTION was run line_numbers: An array of line numbers calculated from the beginning of the JavaScript function definition where there are N1QL() function calls plans: An array of plan information that contains an entry for every embedded SQL++ statement within the JavaScript UDF JavaScript { "function": "default:js2", "line_numbers": [ 4 ], "plans": [ { "cardinality": 1, "cost": 25.51560885530435, "plan": { "#operator": "Authorize", "privileges": { "List": [ { "Target": "default:travel-sample", "Priv": 7, "Props": 0 } ] }, "~child": { "#operator": "Sequence", "~children": [ { "#operator": "Sequence", "~children": [ { "#operator": "Sequence", "~children": [ { "#operator": "PrimaryScan3", "index": "def_primary", "index_projection": { "primary_key": true }, "keyspace": "travel-sample", "limit": "1", "namespace": "default", "optimizer_estimates": { "cardinality": 31591, "cost": 5402.279801258844, "fr_cost": 12.170627071041082, "size": 11 }, "using": "gsi" }, { "#operator": "Fetch", "keyspace": "travel-sample", "namespace": "default", "optimizer_estimates": { "cardinality": 31591, "cost": 46269.39474997121, "fr_cost": 25.46387878667884, "size": 669 } }, { "#operator": "Parallel", "~child": { "#operator": "Sequence", "~children": [ { "#operator": "InitialProject", "discard_original": true, "optimizer_estimates": { "cardinality": 31591, "cost": 47086.49704894546, "fr_cost": 25.489743820991595, "size": 669 }, "preserve_order": true, "result_terms": [ { "expr": "self", "star": true } ] } ] } } ] }, { "#operator": "Limit", "expr": "1", "optimizer_estimates": { "cardinality": 1, "cost": 25.51560885530435, "fr_cost": 25.51560885530435, "size": 669 } } ] }, { "#operator": "Stream", "optimizer_estimates": { "cardinality": 1, "cost": 25.51560885530435, "fr_cost": 25.51560885530435, "size": 669 }, "serializable": true } ] } }, "statement": "SELECT * FROM default:`travel-sample` LIMIT 1 ;" } ] } Constraints If the N1QL() function has been aliased in a JavaScript function definition, EXPLAIN FUNCTION will not be able to return the line numbers where this aliased function was called.Example of such a function definition: JavaScript function js3() { var alias = N1QL; var q = alias("SELECT 1"); } If the UDF contains nested UDF executions, EXPLAIN FUNCTION does not support generating the query plans of SQL++ statements within these nested UDFs. Summary Couchbase 7.6 introduces new features to debug UDFs which will help users peek into UDF execution easily. Helpful References 1. Javascript UDFs: A guide to JavaScript UDFs Creating an external UDF 2. EXPLAIN statement
Samir Behara
Senior Cloud Infrastructure Architect,
AWS
Shai Almog
OSS Hacker, Developer Advocate and Entrepreneur,
Codename One
JJ Tang
Co-Founder,
Rootly
Sudip Sengupta
Technical Writer,
Javelynn