A developer's work is never truly finished once a feature or change is deployed. There is always a need for constant maintenance to ensure that a product or application continues to run as it should and is configured to scale. This Zone focuses on all your maintenance must-haves — from ensuring that your infrastructure is set up to manage various loads and improving software and data quality to tackling incident management, quality assurance, and more.
Software maintenance may require different approaches based on your business goals, the industry you function in, the expertise of your tech team, and the predictive trends of the market. Therefore, along with understanding the different types of software maintenance, you also have to explore various models of software. Based on the kind of problem you are trying to solve, your team can choose the right model from the following options: 1. Quick-Fix Model A quick-fix model in software maintenance is a method for addressing bugs or issues in the software by prioritizing a fast resolution over a more comprehensive solution. This approach typically involves making a small, localized change to the software to address the immediate problem rather than fully understanding and addressing the underlying cause. However, organizations adopt this approach of maintenance only in the case of emergency situations that call for quick resolutions. Under the quick-fix model, tech teams carry out the following software maintenance activities: Annotate software changes by including change IDs and code comments Enter them into a maintenance history detailing why they made the change and the techniques used by them Note each location and merge them via the change ID if there are multiple points in the code change 2. Iterative Enhancement Model The iterative model is used for small-scale application modernization and scheduled maintenance. Generally, the business justification for changes is ignored in this approach as it only involves the software development team, not the business stakeholders. So, the software team will not know if more significant changes are required in the future, which is quite risky. The iterative enhancement model treats the application target as a known quantity. It incorporates changes in the software based on the analysis of the existing system. The iterative model best suits changes made to confined application targets, with little cross-impact on other apps or organizations. 3. Reuse-Oriented Model The reuse-oriented model identifies components of the existing system that are suitable to use again in multiple places. In recent years, this model also includes creating components that can be reused in multiple applications of a system.. There are three ways to incorporate the reuse-oriented model — object and function, application system, and component. Object and function reuse: This model reuses the software elements that implement a single well-defined object. Application system reuse: Under this model, developers can integrate new components in an application without making changes to the system or re-configuring it for a specific user to reuse. Component reuse: Component reuse refers to using a pre-existing component rather than creating a new one in software development. This can include using pre-built code libraries, frameworks, or entire software applications. 4. Boehm’s Model Introduced in 1978, Boehm’s model focuses on measuring characteristics to get non-tech stakeholders involved with the life cycle of software. The model represents a hierarchical structure of high-level, intermediate, and primitive characteristics of software that define its overall quality. The high-level characteristics of quality software are: Maintainability: It should be easy to understand, evaluate, and modify the processes in a system. Portability: Software systems should help in ascertaining the most effective way to make environmental changes As-is utility: It should be easy and effective to use an as-is utility in the system. The intermediate level of characteristics represented by the model displays different factors that validate the expected quality of a software system. These characteristics are: Reliability: Software performance is as expected, with zero defects. Portability: The software can run in various environments and on different platforms. Efficiency: The system makes optimum utilization of code, applications, and hardware resources. Testability: The software can be tested easily and the users can trust the results. Understandability: The end-user should be able to understand the functionality of the software easily and thus, use it effectively. Usability: Efforts needed to learn, use, and comprehend different software functions should be minimum. The primitive characteristics of quality software include basic features like device independence, accessibility, accuracy, etc. 5. Taute Maintenance Model Developed by B.J. Taute in 1983, the Taute maintenance model facilitates development teams to update and perform necessary modifications after executing the software. The Taute model for software maintenance can be carried out in the following phases: Change request phase: In this phase, the client sends the request to make changes to the software in a prescribed format. Estimate phase: Then, developers conduct an impact analysis on the existing system to estimate the time and effort required to make the requested changes. Schedule phase: Here, the team aggregates the change requests for the upcoming scheduled release and creates the planning documents accordingly. Programming phase: In the programming phase, requested changes are implemented in the source code, and all the relevant documents, like design documents and manuals, are updated accordingly. Test phase: During this phase, the software modifications are carefully analyzed. The code is tested using existing and new test cases, along with the implementation of regression testing. Documentation phase: Before the release, system and user documentation are prepared and updated based on regression testing results. Thus, developers can maintain the coherence of documents and code. Release phase: The customer receives the new software product and updated documentation. Then the system’s end users perform acceptance testing. Conclusion Software maintenance is not just a necessary chore, but an essential aspect of any successful software development project. By investing in ongoing maintenance and addressing issues as they arise, organizations can ensure that their software remains reliable, secure, and up-to-date. From bug fixes to performance optimizations, software maintenance is a crucial step in maximizing the value and longevity of your software. So don't overlook this critical aspect of software development — prioritize maintenance and keep your software running smoothly for years to come.
In the cloud-native era, we often hear that "security is job zero," which means it's even more important than any number one priority. Modern infrastructure and methodologies bring us enormous benefits, but, at the same time, since there are more moving parts, there are more things to worry about: How do you control access to your infrastructure? Between services? Who can access what? Etc. There are many questions to be answered, including policies: a bunch of security rules, criteria, and conditions. Examples: Who can access this resource? Which subnet egress traffic is allowed from? Which clusters a workload must be deployed to? Which protocols are not allowed for reachable servers from the Internet? Which registry binaries can be downloaded from? Which OS capabilities can a container execute with? Which times of day can the system be accessed? All organizations have policies since they encode important knowledge about compliance with legal requirements, work within technical constraints, avoid repeating mistakes, etc. Since policies are so important today, let's dive deeper into how to best handle them in the cloud-native era. Why Policy-as-Code? Policies are based on written or unwritten rules that permeate an organization's culture. So, for example, there might be a written rule in our organizations explicitly saying: For servers accessible from the Internet on a public subnet, it's not a good practice to expose a port using the non-secure "HTTP" protocol. How do we enforce it? If we create infrastructure manually, a four-eye principle may help. But first, always have a second guy together when doing something critical. If we do Infrastructure as Code and create our infrastructure automatically with tools like Terraform, a code review could help. However, the traditional policy enforcement process has a few significant drawbacks: You can't be guaranteed this policy will never be broken. People can't be aware of all the policies at all times, and it's not practical to manually check against a list of policies. For code reviews, even senior engineers will not likely catch all potential issues every single time. Even though we've got the best teams in the world that can enforce policies with no exceptions, it's difficult, if possible, to scale. Modern organizations are more likely to be agile, which means many employees, services, and teams continue to grow. There is no way to physically staff a security team to protect all of those assets using traditional techniques. Policies could be (and will be) breached sooner or later because of human error. It's not a question of "if" but "when." And that's precisely why most organizations (if not all) do regular security checks and compliance reviews before a major release, for example. We violate policies first and then create ex post facto fixes. I know, this doesn't sound right. What's the proper way of managing and enforcing policies, then? You've probably already guessed the answer, and you are right. Read on. What Is Policy-as-Code (PaC)? As business, teams, and maturity progress, we'll want to shift from manual policy definition to something more manageable and repeatable at the enterprise scale. How do we do that? First, we can learn from successful experiments in managing systems at scale: Infrastructure-as-Code (IaC): treat the content that defines your environments and infrastructure as source code. DevOps: the combination of people, process, and automation to achieve "continuous everything," continuously delivering value to end users. Policy-as-Code (PaC) is born from these ideas. Policy as code uses code to define and manage policies, which are rules and conditions. Policies are defined, updated, shared, and enforced using code and leveraging Source Code Management (SCM) tools. By keeping policy definitions in source code control, whenever a change is made, it can be tested, validated, and then executed. The goal of PaC is not to detect policy violations but to prevent them. This leverages the DevOps automation capabilities instead of relying on manual processes, allowing teams to move more quickly and reducing the potential for mistakes due to human error. Policy-as-Code vs. Infrastructure-as-Code The "as code" movement isn't new anymore; it aims at "continuous everything." The concept of PaC may sound similar to Infrastructure as Code (IaC), but while IaC focuses on infrastructure and provisioning, PaC improves security operations, compliance management, data management, and beyond. PaC can be integrated with IaC to automatically enforce infrastructural policies. Now that we've got the PaC vs. IaC question sorted out, let's look at the tools for implementing PaC. Introduction to Open Policy Agent (OPA) The Open Policy Agent (OPA, pronounced "oh-pa") is a Cloud Native Computing Foundation incubating project. It is an open-source, general-purpose policy engine that aims to provide a common framework for applying policy-as-code to any domain. OPA provides a high-level declarative language (Rego, pronounced "ray-go," purpose-built for policies) that lets you specify policy as code. As a result, you can define, implement and enforce policies in microservices, Kubernetes, CI/CD pipelines, API gateways, and more. In short, OPA works in a way that decouples decision-making from policy enforcement. When a policy decision needs to be made, you query OPA with structured data (e.g., JSON) as input, then OPA returns the decision: Policy Decoupling OK, less talk, more work: show me the code. Simple Demo: Open Policy Agent Example Pre-requisite To get started, download an OPA binary for your platform from GitHub releases: On macOS (64-bit): curl -L -o opa https://openpolicyagent.org/downloads/v0.46.1/opa_darwin_amd64 chmod 755 ./opa Tested on M1 mac, works as well. Spec Let's start with a simple example to achieve an Access Based Access Control (ABAC) for a fictional Payroll microservice. The rule is simple: you can only access your salary information or your subordinates', not anyone else's. So, if you are bob, and john is your subordinate, then you can access the following: /getSalary/bob /getSalary/john But accessing /getSalary/alice as user bob would not be possible. Input Data and Rego File Let's say we have the structured input data (input.json file): { "user": "bob", "method": "GET", "path": ["getSalary", "bob"], "managers": { "bob": ["john"] } } And let's create a Rego file. Here we won't bother too much with the syntax of Rego, but the comments would give you a good understanding of what this piece of code does: File example.rego: package example default allow = false # default: not allow allow = true { # allow if: input.method == "GET" # method is GET input.path = ["getSalary", person] input.user == person # input user is the person } allow = true { # allow if: input.method == "GET" # method is GET input.path = ["getSalary", person] managers := input.managers[input.user][_] contains(managers, person) # input user is the person's manager } Run The following should evaluate to true: ./opa eval -i input.json -d example.rego "data.example" Changing the path in the input.json file to "path": ["getSalary", "john"], it still evaluates to true, since the second rule allows a manager to check their subordinates' salary. However, if we change the path in the input.json file to "path": ["getSalary", "alice"], it would evaluate to false. Here we go. Now we have a simple working solution of ABAC for microservices! Policy as Code Integrations The example above is very simple and only useful to grasp the basics of how OPA works. But OPA is much more powerful and can be integrated with many of today's mainstream tools and platforms, like: Kubernetes Envoy AWS CloudFormation Docker Terraform Kafka Ceph And more. To quickly demonstrate OPA's capabilities, here is an example of Terraform code defining an auto-scaling group and a server on AWS: With this Rego code, we can calculate a score based on the Terraform plan and return a decision according to the policy. It's super easy to automate the process: terraform plan -out tfplan to create the Terraform plan terraform show -json tfplan | jq > tfplan.json to convert the plan into JSON format opa exec --decision terraform/analysis/authz --bundle policy/ tfplan.json to get the result.
GitOps is a software development and operations methodology that uses Git as the source of truth for deployment configurations. It involves keeping the desired state of an application or infrastructure in a Git repository and using Git-based workflows to manage and deploy changes. Two popular open-source tools that help organizations implement GitOps for managing their Kubernetes applications are Flux and Argo CD. In this article, we’ll take a closer look at these tools, their pros and cons, and how to set them up. Common Use Cases for Flux and Argo CD Flux Continuous delivery: Flux can be used to automate the deployment pipeline and ensure that changes are automatically deployed as soon as they are pushed to the Git repository. Configuration management: Flux allows you to store and manage your application’s configuration as code, making it easier to version control and track changes. Immutable infrastructure: Flux helps enforce an immutable infrastructure approach—where changes are made only through the Git repository and not through manual intervention on the cluster. Blue-green deployments: Flux supports blue-green deployments—where a new version of an application is deployed alongside the existing version, and traffic is gradually shifted to the new version. Argo CD Continuous deployment: Argo CD can be used to automate the deployment process, ensuring that applications are always up-to-date with the latest changes from the Git repository. Application promotion: Argo CD supports application promotion—where applications can be promoted from one environment to another. For example, from development to production. Multi-cluster management: Argo CD can be used to manage applications across multiple clusters, ensuring the desired state of the applications is consistent across all clusters. Rollback management: Argo CD provides rollback capabilities, making it easier to revert changes in case of failures. The choice between the two tools depends on the specific requirements of the organization and application, but both tools provide a GitOps approach to simplify the deployment process and reduce the risk of manual errors. They both have their own pros and cons, and in this article, we’ll take a look at what they are and how to set them up. What Is Flux? Flux is a GitOps tool that automates the deployment of applications on Kubernetes. It works by continuously monitoring the state of a Git repository and applying any changes to a cluster. Flux integrates with various Git providers such as GitHub, GitLab, and Bitbucket. When changes are made to the repository, Flux automatically detects them and updates the cluster accordingly. Pros of Flux Automated deployments: Flux automates the deployment process, reducing manual errors and freeing up developers to focus on other tasks. Git-based workflow: Flux leverages Git as a source of truth, which makes it easier to track and revert changes. Declarative configuration: Flux uses Kubernetes manifests to define the desired state of a cluster, making it easier to manage and track changes. Cons of Flux Limited customization: Flux only supports a limited set of customizations, which may not be suitable for all use cases. Steep learning curve: Flux has a steep learning curve for new users and requires a deep understanding of Kubernetes and Git. How To Set Up Flux Prerequisites A running Kubernetes cluster. Helm installed on your local machine. A Git repository for your application's source code and Kubernetes manifests. The repository URL and a SSH key for the Git repository. Step 1: Add the Flux Helm Repository The first step is to add the Flux Helm repository to your local machine. Run the following command to add the repository: Shell helm repo add fluxcd https://charts.fluxcd.io Step 2: Install Flux Now that the Flux Helm repository is added, you can install Flux on the cluster. Run the following command to install Flux: Shell helm upgrade -i flux fluxcd/flux \ --set git.url=git@github.com:<your-org>/<your-repo>.git \ --set git.path=<path-to-manifests> \ --set git.pollInterval=1m \ --set git.ssh.secretName=flux-git-ssh In the above command, replace the placeholder values with your own Git repository information. The git.url parameter is the URL of the Git repository, the git.path parameter is the path to the directory containing the Kubernetes manifests, and the git.ssh.secretName parameter is the name of the SSH secret containing the SSH key for the repository. Step 3: Verify the Installation After running the above command, you can verify the installation by checking the status of the Flux pods. Run the following command to view the pods: Shell kubectl get pods -n <flux-namespace> If the pods are running, Flux has been installed successfully. Step 4: Connect Flux to Your Git Repository The final step is to connect Flux to your Git repository. Run the following command to generate a SSH key and create a secret: Shell ssh-keygen -t rsa -b 4096 -f id_rsa kubectl create secret generic flux-git-ssh \ --from-file=id_rsa=./id_rsa --namespace=<flux-namespace> In the above command, replace the <flux-namespace> placeholder with the namespace where Flux is installed. Now, add the generated public key as a deployment key in your Git repository. You have successfully set up Flux using Helm. Whenever changes are made to the Git repository, Flux will detect them and update the cluster accordingly. In conclusion, setting up Flux using Helm is a quite simple process. By using Git as a source of truth and continuously monitoring the state of the cluster, Flux helps simplify the deployment process and reduce the risk of manual errors. What Is Argo CD? Argo CD is an open-source GitOps tool that automates the deployment of applications on Kubernetes. It allows developers to declaratively manage their applications and keeps the desired state of the applications in sync with the live state. Argo CD integrates with Git repositories and continuously monitors them for changes. Whenever changes are detected, Argo CD applies them to the cluster, ensuring the application is always up-to-date. With Argo CD, organizations can automate their deployment process, reduce the risk of manual errors, and benefit from Git’s version control capabilities. Argo CD provides a graphical user interface and a command-line interface, making it easy to use and manage applications at scale. Pros of Argo CD Advanced deployment features: Argo CD provides advanced deployment features, such as rolling updates and canary deployments, making it easier to manage complex deployments. User-friendly interface: Argo CD provides a user-friendly interface that makes it easier to manage deployments, especially for non-technical users. Customizable: Argo CD allows for greater customization, making it easier to fit the tool to specific use cases. Cons of Argo CD Steep learning curve: Argo CD has a steep learning curve for new users and requires a deep understanding of Kubernetes and Git. Complexity: Argo CD has a more complex architecture than Flux, which can make it more difficult to manage and troubleshoot. How To Set Up Argo CD Argo CD can be installed on a Kubernetes cluster using Helm, a package manager for Kubernetes. In this section, we’ll go through the steps to set up Argo CD using Helm. Prerequisites A running Kubernetes cluster. Helm installed on your local machine. A Git repository for your application’s source code and Kubernetes manifests. Step 1: Add the Argo CD Helm Repository The first step is to add the Argo CD Helm repository to your local machine. Run the following command to add the repository: Shell helm repo add argo https://argoproj.github.io/argo-cd Step 2: Install Argo CD Now that the Argo CD Helm repository is added, you can install Argo CD on the cluster. Run the following command to install Argo CD: Shell helm upgrade -i argocd argo/argo-cd --set server.route.enabled=true Step 3: Verify the Installation After running the above command, you can verify the installation by checking the status of the Argo CD pods. Run the following command to view the pods: Shell kubectl get pods -n argocd If the pods are running, Argo CD has been installed successfully. Step 4: Connect Argo CD to Your Git Repository The final step is to connect Argo CD to your Git repository. Argo CD provides a graphical user interface that you can use to create applications and connect to your Git repository. To access the Argo CD interface, run the following command to get the URL: Shell kubectl get routes -n argocd Use the URL in a web browser to access the Argo CD interface. Once you’re in the interface, you can create a new application by providing the Git repository URL and the path to the Kubernetes manifests. Argo CD will continuously monitor the repository for changes and apply them to the cluster. You have now successfully set up Argo CD using Helm. Conclusion GitOps is a valuable approach for automating the deployment and management of applications on Kubernetes. Flux and Argo CD are two popular GitOps tools that provide a simple and efficient way to automate the deployment process, enforce an immutable infrastructure, and manage applications in a consistent and predictable way. Flux focuses on automating the deployment pipeline and providing configuration management as code, while Argo CD provides a more complete GitOps solution, including features such as multi-cluster management, application promotion, and rollback management. Both tools have their own strengths and weaknesses, and the choice between the two will depend on the specific requirements of the organization and the application. Regardless of the tool chosen, GitOps provides a valuable approach for simplifying the deployment process and reducing the risk of manual errors. By keeping the desired state of the applications in sync with the Git repository, GitOps ensures that changes are made in a consistent and predictable way, resulting in a more reliable and efficient deployment process.
Modern organizations need complex IT infrastructures functioning properly to provide goods and services at the expected level of performance. Therefore, losing critical parts or the whole infrastructure can put the organization on the edge of disappearance. Disasters remain a threat to production processes. What Is a Disaster? A disaster is challenging trouble that instantly overwhelms the capacity of available human, IT, financial and other resources and results in significant losses of valuable assets (for example, documents, intellectual property objects, data, or hardware). In most cases, a disaster is a sudden chain of events causing non-typical threats that are difficult or impossible to stop once the disaster starts. Depending on the type of disaster, an organization needs to react in specific ways. There are three main types of disasters: Natural disasters Technological and human-made disasters Hybrid disasters A natural disaster is the first thing that probably comes to your mind when you hear the word “disaster”. Different types of natural disasters include floods, earthquakes, forest fires, abnormal heat, intense snowfalls, heavy rains, hurricanes and tornadoes, and sea and ocean storms. Technological disaster is the consequence of anything connected with the malfunctions of tech infrastructure, human error, or evil will. The list can include any issue from a software disruption in an organization to a power plant problem causing difficulties in the whole city, region, or even country. These are disasters such as global software disruption, critical hardware malfunction, power outages, and electricity supply problems, malware infiltration (including ransomware attacks), telecommunication issues (including network isolation), military conflicts, terrorism incidents, dam failures, chemical incidents. The third category to mention describes mixed disasters that unite the features of natural and technological factors. For example, a dam failure can cause a flood resulting in a power outage and communication issues across the entire region or country. What Is Disaster Recovery? Disaster recovery (DR) is a set of actions (methodology) that an organization should take to recover and restore operations after a global disruptive event. Major disaster recovery activities focus on regaining access to data, hardware, software, network devices, connectivity, and power supply. DR actions can also cover rebuilding logistics, and relocating staff members and office equipment, in case of damaged or destroyed assets. To create a disaster recovery plan, you need to think over the action sequences to complete during these periods: Before the disaster (building, maintaining, and testing the DR system and policies). During the disaster (applying the immediate response measures to avoid or mitigate asset losses). After the disaster (applying the DR system to restore operation, contacting clients, partners, and officials, and analyzing losses and recovery efficiency). Here are the points to include in your disaster recovery plan. Business Impact Analysis and Risk Assessment Data At this step, you study threats and vulnerabilities typical and most dangerous for your organization. With that knowledge, you can also calculate the probability of a particular disaster occurring, measure potential impacts on your production and implement suitable disaster recovery solutions easier. Recovery Objectives: Defined RPO and RTO RPO is the recovery point objective: the parameter defines the amount of data you can lose without a significant impact on production. RTO is the recovery time objective: the longest downtime your organization can tolerate and, thus, the maximum time you can have to complete recovery workflows. Distribution of Responsibilities A team that is aware of every member’s duties in case of disaster is a must-have component of an efficient DR plan. Assemble a special DR team, assign specific roles to every employee and train them to fulfill their roles before an actual disaster strikes. This is the way to avoid confusion and missing links when real action is required to save an organization’s assets and production. DR Site Creation A disaster of any scale or nature can critically damage your main server and production office, making resuming operations there impossible or extraordinarily time-consuming. In this situation, a prepared DR site with replicas of critical workloads is the best choice to minimize RTO and continue providing services to the organization’s clients during and after in an emergency. Failback Preparations Failback, which is the process of returning the workloads back to the main site when the main data center is operational again, can be overlooked when planning disaster recovery. Nevertheless, establishing failback sequences beforehand helps to make the entire process smoother and avoid minor data losses that might happen otherwise. Additionally, keep in mind that a DR site is usually not designed to support your infrastructure’s functioning for a prolonged period. Remote Storage for Crucial Documents and Assets Even small organizations produce and process a lot of crucial data nowadays. Losing hard copies or digital documents can make their recovery time-consuming, expensive, or even impossible. Thus, preparing remote storage (for example, VPS cloud storage for digital docs and protected physical storage for hard copy assets) is a solid choice to ensure the accessibility of important data in case of disaster. You can check the all-in-one solution for VMware disaster recovery at once if you want. Equipment Requirements Noted This DR plan element requires auditing the nodes that enable the functioning of your organization’s IT infrastructure. This includes computers, physical servers, network routers, hard drives, cloud-based server hosting equipment, etc. That knowledge enables you to view the elements required to restore the original state of the IT environment after a disaster. What’s more, you can see the list of equipment required to support at least mission-critical workloads and ensure production continuity when the main resource is unavailable. Communication Channels Defined Ensure enabling a stable and reliable internal communication system for your staff members, management, and DR team. Set the order of communication channels’ usage to deal with the unavailability of the main server and internal network right after a disaster. Response Procedures Outlined In a DR plan, the first hours are critical. Create step-by-step instructions on how to execute DR activities, monitor and conduct processes, failover sequences, system recovery verification, etc. In case a disaster still hits the production center despite all the prevention measures applied, a concentrated and rapid response to a particular event can help mitigate the damage. Incident Reporting to Stakeholders After a disaster strikes and disrupts your production, not only DR team members should be informed. You also need to notify key stakeholders, including your marketing team, third-party suppliers, partners, and clients. As a part of your disaster recovery plan, create outlines and scripts showing your staff how to inform every critical group regarding its concerns. Additionally, a basic press release created beforehand can help you not to waste time during an actual incident. DR Plan Testing and Adjustment Successful organizations change and expand with time, and their DR plans should be adjusted according to the relevant needs and recovery objectives. Test your plan right after you finish it, and perform additional testing every time you introduce changes. Thus, you can measure the efficiency of a disaster recovery plan and ensure the recoverability of your assets. Optimal DR Strategy Applied The DR strategy can be implemented on a DIY (do it yourself) basis or delegated to a third-party vendor. The former choice is the way to sacrifice reliability in favor of the economy, while the latter one can be more expensive but more efficient. The choice of a DR strategy fully depends on your organization’s features, including the team size, IT infrastructure complexity, budget, risk factors, and desired reliability, among others. Summary A disaster is a sudden destructive event that can render an organization inoperable. Natural, human-made, and hybrid disasters have different levels of predictability, but they are barely preventable at an organization’s level. The only way to ensure the safety of an organization is to create a reliable disaster recovery plan based on the organization’s specific needs. The key elements of a DR plan are: Risk assessment and impact analysis, Defined RPO and RTO DR team responsibilities distributed DR site creation Preparations for failback Remote storage Equipment list Established communication channels Immediate response sequences Incident reporting instructions Disaster recovery testing and adjustment. Optimal DR strategy choice.
You know that documentation is crucial in any software development, mainly because it increases the quality, decreases the number of meetings, and makes the team more scalable. The question is how to start a new repository or a non-documentation project. This project will clarify how to start documentation in a regular source code repository. The documentation on the source repository applies as tactical documentation. There are also strategic ones that cover the architecture, such as the C4 model, tech radar, and architecture decision record (ADR), which we won't cover in this tutorial. Before starting, we'll use AsciiDoc instead of Markdown. AsciiDoc has several features to reduce the boilerplate and has more capabilities than Markdown. Furthermore, AsciiDoc has support for most Markdown syntax; thus, it will be smooth to use and migrate to Asciidoc. README The README file is onboard any Git repository. That is the first contact of a developer on the source code. This file must have a brief context, such as: An introductory paragraph explaining why the repository exists The goal of bullets A Getting Started section of the repository and How To Install (i.e., in a Maven repository project, you can add the Maven dependency. A highlight of the API Markdown = Project Name :toc: auto == Introduction A paragraph that explains the "why" or reason for this project exists. == Goals * The goals on bullets * The second goal == Getting Started Your reader gets into here; they need to know how to use and install it. == The API overview The coolest features here == To know more More references such as books, articles, videos, and so on. == Samples * https://github.com/spring-projects/spring-data-commons[Spring data commons] * https://github.com/eclipse/jnosql[JNoSQL] * https://github.com/xmolecules/jmolecules[jmolecules] We have our first file and the overview of the project, what I can and cannot do, and the next step is going to the historical moment of the project and what was released on any version in our subsequent documentation. Changelog We can make an analogy with each version. The changelog has all the notable changes in a single file, starting with the latest version; thus, the developer knows what has changed on each version. The main goal is to be easier than Git history. Accordingly, it should have the crucial moments of each performance. Briefly, each version has the date and the version number, plus categories of changes such as added, changed, fixed, and removed. The source code below shows a case. Markdown = Changelog :toc: auto All notable changes to this project will be documented in this file. The format is based on https://keepachangelog.com/en/1.0.0/[Keep a Changelog], and this project adheres to https://semver.org/spec/v2.0.0.html[Semantic Versioning]. == [Unreleased] === Added - Create === Changed - Changed === Removed - Remove === Fixed - Ops, fixed == [old-version] - 2022-08-04 Great! Now, anyone in the team can see the changes by version and what the source repository does without going to several meetings or spending time to find the proposal of this repository. Whereas those types are outside the code, let's go deep inside the code. Code Documentation Yes, documentation inside the code helps the maintainability of any software. It is worth mentioning as many as possible. It is mainly to destroy the idea of a self-documenting code because it is a utopia in the IT area. You can do both good code, good documentation, and a test that helps with documentation. You can explore the source code to be complementary, explain the code design's why, and bring the business context to the code. Remember, the tests also help with documentation. Please, don't use the source code to explain documentation. It is a waste of documentation and a bad practice. On Java, you can explore the Javadoc capabilities; if you are not, don't worry about which language has a specific tool for it. Conclusion This article explains three documentation types to start a regular source code repository. Please, be aware it is a start, and documentation may vary with the code project, such as OpenAPI with Swagger when we talk about REST API. Documentation is crucial, and we can see several colossal software such as open-source projects like Java, Linux, Go, etc., have documentation. Indeed, with several projects, proposals, languages, and architectural styles, documentation as the first citizen is the expected behavior among them. I hope you understand how meaningful documentation is and good codes, and I hope that you make your and your company simpler with more documentation in the source code repository.
Tech moves fast. As deadlines get closer, it can be tempting to code fast to get it shipped as quickly as possible. But fast and good don't always equate. And worse, where there's dud code, there's a chump stuck refactoring their predecessors' dirty code and dealing with a mountain of technical debt. And no one wants that to be you. Fortunately, engineering leads can set the standard with practices prioritizing code quality metrics and health. In this article, I’ll cover… Why code quality matters The most important code quality metrics How to handle code quality issues Why Code Quality Matters Code quality is crucial for software development as it determines how well the codebase functions, scales, and maintains over time. Good quality code ensures the software is stable, secure, and efficient. On the other hand, poor-quality code can lead to bugs, security vulnerabilities, performance issues, and scalability problems. Code quality impacts the amount of technical debt that accumulates over time. Technical debt refers to the cost of maintaining a codebase developed with suboptimal practices or shortcuts. It accumulates over time, and if left unchecked, it can make it difficult or impossible to maintain or improve the codebase without significant effort or cost. Due to deadlines, budgets, and other constraints, developers are often forced to make tradeoffs between quality and speed, and accumulating technical debt is inevitable. However, accumulating technical debt imprudently can increase the risk of bugs and security vulnerabilities, make adding features harder, slow development, and even require a complete codebase rewrite in extreme cases. But how do we get to this place? No single metric or focus point can fully capture code quality. Rather it's the combined efforts that help build and maintain healthy codebases. Here are some of the mission-critical metrics you need to prioritize: Code Churn Let's be clear. Code churn is a normal part of the development process. But the amount of code churn also helps you measure your development team's efficiency and productivity. If the bulk of a dev's time is spent writing and rewriting the same bits of code, it can suggest two things: The problem code is being tested, workshopped, and refined. This is a good thing, especially when dealing with a sticky problem or challenge that requires regular reworking as part of an evolutionary process. Time spent learning that results in a win is always well-spent. Code churn in response to customer feedback is a plus. There's a bigger problem. Code churn that consistency results in low work output can be a sign of: Poor programming practices Lack of experience or skill at self-editing Low morale or burnout External woes like ever-changing demands from other departments or external stakeholders. (Yep, this is why they pay you the big bucks). 3. Besides talking regularly with your team about their process and regular code reviews, you can also measure code churn using various tools such as: Azure DevOps Server NCover Minware Git Coding Conventions In most companies, marketing and editorial departments have a style guide, a living document that outlines the rules for written content — think everything from grammar to whether you write 1M or 1 million. Dev teams should also have coding conventions — they might be specific to particular projects or across the whole team. After all, code is also about interpersonal communication, not just machines. And a shared language helps create quality, clean code by improving readability and maintainability. It also helps team members understand each other's code and easily find files or classes. Some common coding conventions: Naming Conventions Comments Spaces Operators Declarations Indentation File Organization Architectural Patterns You might think of using a code linter to highlight code smells or security gaps, but you can also use it to monitor code conventions. Tools include: Prettier (JavaScript) Rubocop (Ruby) StyleCop (C#) And a whole lot of open-source options. Get at it! While at it, spend some time on an agreed-upon approach to documentation. Think of it as passing a torch or baton of knowledge from past you to future you. Code Complexity Ok, some code is just an intricate maze born of pain. But it may also be an impenetrable jungle that's impossible for anyone to understand except the original person who wrote it. It gets shuffled to the bottom of the pile every time the team talks about refactoring to reduce technical debt. And remember, many devs only stay in one role for a couple of years, so that person could spread their jungle of complex code like a virus across multiple organizations if left unchecked. Terrifying. But back to business. Simply put, code complexity refers to how difficult code is to understand, maintain, and modify. Overly complex code can also be at risk of bugs and may resist new ads. There are two main methods of measuring code complexity: Cyclomatic Complexity measures the complexity of a program based on the number of independent paths through the code. It is a quantitative measure of the number of decision points in the code, such as if statements and for/while loops. The more paths through your code, the more potential problem areas in the code that may require further testing or refactoring. Halstead Complexity Metrics, which measures the size and difficulty of a program based on the program length, vocabulary size, the number of distinct operators and operands, the number of statements, and the number of bugs per thousand lines of code. These metrics evaluate software quality and maintainability and predict the effort required to develop, debug, and maintain software systems. That said, Halstead Complexity Metrics don't factor in other important metrics like code readability, performance, security, and usability, so it's not something to use as a stand-alone tool. Check out Verifysoft if you want to dig into the metrics more. Code Maintainability The code maintainability is pretty much what it sounds like – how easy it is to maintain and modify a codebase over time. Poorly maintained code causes delays as it takes longer to update and can stand out as a critical flaw against your competitors. But done well, it pulls together all the good stuff, including code clarity, complexity, naming conventions, code refactoring, documentation, version control, code reviews, and unit testing. So, it's about a broader commitment to code quality rather than a single task or tool. And it's always worth remembering that, like refactoring, you'll maintain your and other people's code a long time after the first write. Fortunately, there are automation tools aplenty from the get-go when it comes to maintainability. For example, static code analysis tools identify bug risks, anti-patterns, performance issues, and unused code. New Bugs vs. Closed Bugs Every bug in your software is a small piece of technical debt that accumulates over time. To keep track of your technical debt, your engineers must track and document every bug, including when they are fixed. One way to measure your code quality is to compare the number of new bugs discovered to the number of bugs that have been closed. If new bugs consistently outnumber closed bugs, it's a sign that your technical debt is increasing and that you need to take action to manage it effectively. Tracking new bugs versus closed bugs helps identify potential issues in the development process, such as insufficient testing, poor quality control, or lack of resources for fixing bugs. By comparing, you can make informed decisions about allocating resources, prioritizing bug fixes, and improving your development practices to reduce technical debt and improve code quality over time. I co-founded Stepsize to fix this problem. Our tool helps modern engineering teams improve code quality by making it easy to identify, track, prioritize, and fix technical debt or code quality issues. Create, view, and manage issues directly from your codebase with the Stepsize tool. Issues are linked to code, making them easy to understand and prioritize. Use the codebase visualization tool to understand the distribution of tech debt and code quality issues in the codebase. Use powerful filters to understand the impact on your codebase, product, team, and business priorities. Rounding Up It's important to remember that effective code quality management involves more than just relying on a single metric or tool. Instead, engineering leads need to prioritize embedding a commitment to code quality tasks and tools into the daily workflow to ensure consistent improvement over time. This includes helping team members develop good code hygiene through habit stacking and skill development, which can significantly benefit their careers. Tracking and prioritizing technical debt is a critical aspect of increasing code quality. By doing so, teams can make a strong business case for refactoring the essential parts of their codebase, leading to more efficient and maintainable software in the long run.
One of my favorite things about QuestDB is the ability to write queries in SQL against a high-performance time series database. Since I’ve been using SQL as my primary query language for basically my entire professional career, it feels natural for me to interact with data using SQL instead of other newer proprietary query languages. Combined with QuestDB’s custom SQL extensions, its built-in SQL support makes writing complex queries a breeze. In my life as a cloud engineer, I deal with time series metrics all the time. Unfortunately, many of today’s popular metrics databases don’t support the SQL query language. As a result, I’ve become more dependent on pre-built dashboards, and it takes me longer to write my own queries with JOINs, transformations, and temporal aggregations. QuestDB can be a great choice for ingesting application and infrastructure metrics, it just requires a little more work on the initial setup than the Kubernetes tooling du jour. Despite this extra upfront time investment (which is fairly minimal in the grand scheme of things), I think the benefits of using QuestDB for infrastructure metrics are worth it. In this article, I will demonstrate how we use QuestDB as the main component in this new feature. This should provide enough information for you to also use QuestDB for ingesting, storing, and querying infrastructure metrics in your own clusters. Architecture Prometheus is a common time series database that is already installed in many Kubernetes clusters. We will be leveraging its remote write functionality to pipe data into QuestDB for querying and storage. However, since Prometheus remote write does not support the QuestDB-recommended InfluxDB Line Protocol (ILP) as a serialization format, we need to use a proxy to translate Prometheus-formatted metrics into ILP messages. We will use InfluxData’s Telegraf as this translation component. Now, with our data in QuestDB, we can use SQL to query our metrics using any one of the supported methods: the Web Console, PostgreSQL wire protocol, or HTTP REST API. Here’s a quick overview of the architecture: Prometheus Remote Write While Prometheus operates on an interval-based pull model, it also has the ability to push metrics to remote sources. This is known as “remote write” capability, and is easily configurable in a YAML file. Here’s an example of a basic remote write configuration: YAML remoteWrite: - url: http://default.telegraf.svc:9999/write name: questdb-telegraf remote_timeout: 10s This YAML will configure Prometheus to send samples to the specified URL with a 10-second timeout. In this case, we will be forwarding our metrics on to Telegraf, with a custom port and endpoint that we can specify in the Telegraf config (see below for more details). There are also a variety of other remote write options, allowing users to customize timeouts, headers, authentication, and additional relabling configs before writing to the remote data store. All of the possible options can be found on the Prometheus website. QuestDB ILP and Telegraf Now that we have our remote write configured, we need to set up its destination. Installing Telegraf into a cluster is straightforward, just helm install its Helm chart. We do need to configure Telegraf to read from a web socket (where Prometheus is configured to write to) and send it to QuestDB for long-term storage. In a Kubernetes deployment, these options can be set in the config section of the Telegraf Helm chart’s values.yaml file. Input Configuration Since Telegraf will be receiving metrics from Prometheus, we need to open a port that enables communication between the two services. Telegraf has an HTTP listener plugin that allows it to listen for traffic on a specified port. We also need to configure the path of the listener to match our Prometheus remote write URL. The HTTP listener (v2) supports multiple data formats to consume via its plugin architecture. A full list of options can be found in the Telegraf docs. We will be using the Prometheus Remote Write Parser Plugin to accept our Prometheus messages. Here is how this setup looks in the Telegraf config: TOML [[inputs.http_listener_v2]] ## Address and port to host HTTP listener on service_address = ":9999" ## Paths to listen to. paths = ["/write"] ## Data format to consume. data_format = "prometheusremotewrite" When passing these values to the Helm chart, you can use this yaml specification: YAML config: inputs: - http_listener_v2: service_address: ":9999" path: "/write" data_format: prometheusremotewrite Output Configuration We recommend that you use the InfluxDB Line Protocol (ILP) over TCP to insert data into QuestDB. Luckily, Telegraf includes an ILP output plugin. But unfortunately, this is not a plug-and-play solution. By default, all metrics will be written to a single measurement, prometheus_remote_write, with the individual metric’s key being sent over the wire as a field. In practice, this means all of your metrics will be written to a single QuestDB table, called prometheus_remote_write. There will then be an additional column for every single metric and field you are capturing. This leads to a large table, with potentially thousands of columns, that’s difficult to work with and contains all sparse data, which could negatively impact performance. To fix this problem, Telegraf provides us with a sample starlark script that transforms each measurement such that we will have a table-per-metric in QuestDB. This script will run for every metric that Telegraf receives, so the output will be formatted correctly. This is what Telegraf’s output config looks like: TOML [[outputs.socket_writer]] ## Address and port to write to address = "tcp://questdb.questdb.svc:9009" [[processors.starlark]] source = ''' def apply(metric): if metric.name == "prometheus_remote_write": for k, v in metric.fields.items(): metric.name = k metric.fields["value"] = v metric.fields.pop(k) return metric ''' As an added benefit to using ILP with QuestDB, we don’t have to worry about each metric’s fieldset. Over ILP, QuestDB automatically creates tables for new metrics. It also adds new columns for fields that it hasn’t seen before, and INSERTs nulls for any missing fields. Helm Configuration I’ve found that the easiest way to configure the values.yaml file is to mount the starlark script as a volume, and add a reference to it in the config. This way we don’t need to deal with any whitespace-handling or special indentation in our ConfigMap specification. The output and starlark Helm configuration would look like this: YAML # continued from above # config: outputs: - socket_writer: address: tcp://questdb.questdb.svc:9009 processors: - starlark: script: /opt/telegraf/remotewrite.star We also need to add the volume and mount at the root level of the values.yaml: YAML volumes: - name: starlark-script configMap: name: starlark-script mountPoints: - name: starlark-script mountPath: /opt/telegraf subpath: remotewrite.star This volume references a ConfigMap that contains the starlark script from the above example: YAML --- apiVersion: v1 kind: ConfigMap metadata: name: starlark-script data: remotewrite.star: | def apply(metric): ... Querying Metrics With SQL QuestDB has some powerful SQL extensions that can simplify writing time series queries. For example, given the standard set of metrics that a typical Prometheus installation collects, we can use QuestDB to not only find pods with the highest memory usage in a cluster (over a six-month period) but also find the specific time period when the memory usage spiked. We can even access custom labels to help identify the pods with a human-readable name (instead of the long alphanumeric name assigned to pods by deployments or stateful sets). This is all performed with a simple SQL syntax using JOINs (enhanced by the ASOF keyword) and SAMPLE BY to bucket data into days with a simple line of SQL: SQL SELECT l.label_app_kubernetes_io_custom_name, w.timestamp, max(w.value / r.value) as mem_usage FROM container_memory_working_set_bytes AS w ASOF JOIN kube_pod_labels AS l ON (w.pod = l.pod) ASOF JOIN kube_pod_container_resource_limits AS r ON ( r.pod = w.pod AND r.container = w.container ) WHERE label_app_kubernetes_io_custom_name IS NOT NULL AND r.resource = 'memory' AND w.timestamp > '2022-06-01' SAMPLE BY 1d ALIGN TO CALENDAR TIME ZONE 'Europe/Berlin' ORDER BY mem_usage DESC; Here’s a sample output of that query: label_app_kubernetes_io_custom_name timestamp mem_usage keen austin 2022-07-04T16:18:00.000000Z 0.999853875401 optimistic banzai 2022-07-12T16:18:00.000000Z 0.9763028946 compassionate taussig 2022-07-11T16:18:00.000000Z 0.975367909527 cranky leakey 2022-07-11T16:18:00.000000Z 0.974941994418 quirky morse 2022-07-05T16:18:00.000000Z 0.95084235665 admiring panini 2022-06-21T16:18:00.000000Z 0.925567626953 This is only one of many ways that you can use QuestDB to write powerful time-series queries that you can use for one-off investigation or to power dashboards. Metric Retention Since databases storing infrastructure metrics can grow to extreme sizes over time, it is important to enforce a retention period to free up space by deleting old metrics. Even though QuestDB does not support the traditional DELETE SQL command, you can still implement metric retention by using the DROP PARTITION command. In QuestDB, data is stored by columns on-disk and optionally partitioned by a time duration. By default, when using ILP to ingest metrics, and a new table is automatically created, it is partitioned by DAY. This allows us to DROP PARTITIONs on a daily basis. If you need a different partitioning scheme, you can create the table with your desired partition period before ingesting any data over ILP, since ALTER TABLE does not support any changes to table partitioning. But since ILP automatically adds columns, the table specification can be very simple, with just the name and a timestamp column. Once you’ve decided on your desired metric retention period, you can create a cron job that removes all partitions older than your oldest retention date. This will help keep your storage usage in check. Working Example I have created a working example of this setup in a repo, sklarsa/questdb-metrics-blog-post. The entire example runs in a local Kind cluster. To run the example, execute the following commands: Shell git clone https://github.com/sklarsa/questdb-metrics-blog-post.git cd questdb-metrics-blog-post ./run.sh After a few minutes, all pods should be ready with the following prompt: Plain Text You can now access QuestDB here: http://localhost:9000 Ctrl-C to exit Forwarding from 127.0.0.1:9000 -> 9000 Forwarding from [::1]:9000 -> 9000 From here, you can navigate to http://localhost:9000 and explore the metrics that are being ingested into QuestDB. The default Prometheus scrape interval is thirty seconds, so there might not be a ton of data in there, but you should see a list of tables, one per each metric that we are collecting: Once you’re done, you can clean up the entire experiment by deleting the cluster: Shell ./cleanup.sh Conclusion QuestDB can be a very powerful piece in the Cloud Engineer’s toolkit. It grants you the ability to run complex time-series queries across multiple metrics with unparalleled speed in the world's most ubiquitous query language, SQL. Every second counts when debugging an outage at 2AM, and reducing the cognitive load of writing queries, as well as their execution time, is a game-changer for me.
Data lineage, an automated visualization of the relationships for how data flows across tables and other data assets, is a must-have in the data engineering toolbox. Not only is it helpful for data governance and compliance use cases, but it also plays a starring role as one of the five pillars of data observability. Data lineage accelerates a data engineer’s ability to understand the root cause of a data anomaly and the potential impact it may have on the business. As a result, data lineage’s popularity as a must-have component of modern data tooling has skyrocketed faster than a high schooler with parents traveling out of town for the weekend. As a result, almost all data catalogs have introduced data lineage in the last few years. More recently, some big data cloud providers, such as Databricks and Google (as part of Dataplex), have announced data lineage capabilities. It’s great to see that so many leaders in the space, like Databricks and Google, realize the value of lineage for use cases across the data stack, from data governance to discovery. But now that there are multiple solutions offering some flavor of data lineage, the question arises, “does it still need to be a required feature within a data quality solution?” The answer is an unequivocal “yes.” When it comes to tackling data reliability, vanilla lineage just doesn’t cut it. Here’s why… 1. Data Lineage Informs Incident Detection and Alerting Data lineage powers better data quality incident detection and alerting when it’s natively integrated within a data observability platform. For example, imagine you have an issue with a table upstream that cascades into multiple other tables across several downstream layers. Do you want your team to get one alert, or do you want to get 15 – all for the same incident? The first option accurately depicts the full context along with a natural point to start your root cause analysis. The second option is akin to receiving 15 pages of a book out of order and hoping your on-call data engineer is able to piece together they are all part of a single story. As a function of data observability, data lineage pieces together this story automatically, identifying which one is the climax and which ones are just falling into action. Not to mention, too many superfluous alerts are the quickest route to alert fatigue–scientifically defined as the point where the data engineer rolls their eyes, shakes their head, and moves on to another task. So when your incident management channel in Slack has more than 25 unread messages, all corresponding to the same incident, are you really getting value from your data observability platform? One way to help combat alert fatigue and improve incident detection is to set alert parameters to only notify you about anomalies with your most important tables. However, without native data lineage, it’s difficult and time-consuming to understand what assets truly are important. One of the keys to operationalizing data observability is to ensure alerts are routed to the right responders–those who best understand the domain and particular systems in question. Data lineage can help surface and route alerts to the appropriate owners on both the data team and business stakeholder sides of the house. 2. Data Lineage Accelerates Incident Resolution Data engineers are able to fix broken pipelines and anomalous data faster when data lineage is natively incorporated within the data observability platform. Without it, you just have a list of incidents and a map of table/field dependencies, neither of which is particularly useful without the other. Without incidents embedded in lineage, those dots aren’t connected–and they certainly aren’t connected to how data is consumed within your organization. For example, data lineage is essential to the incident triage process. To butcher a proverb, “If a table experiences an anomaly, but no one consumes data from it, do you care?” Tracing incidents upstream across two different tools is a disjointed process. You don’t just want to swim to the rawest upstream table; you want to swim up to the most upstream table where the issue is still present. Of course, once we arrive at our most upstream table with an anomaly, our root cause analysis process has just begun. Data lineage gives you the where but not always the why. Data teams must now determine if it is: A systems issue: Did an Airflow job not run? Were there issues with permissions in Snowflake? A code issue: Did someone modify a SQL query or dbt model that mucked everything up? A data issue: Did a third party send us garbage data filled with NULLs and other nonsense? Data lineage is valuable, but it is not a silver bullet for incident resolution. It is at its best when it works within a larger ecosystem of incident resolution tools such as query change detection, high correlation insights, and anomalous row detection. 3. A Single Pane of Glass Sometimes vendors say their solution provides “a single pane of glass” with a bit too much robotic reverence and without enough critical thought toward the value provided. Nice to look at but not very useful. How I imagine some vendors say, “a single pane of glass.” In the case of data observability, however, a single pane of glass is integral to how efficient and effective your data team can be in its data reliability workflows. I previously mentioned the disjointed nature of cross-referencing your list of incidents to your map of data incidents. Still, it’s important to remember data pipelines extend beyond a single environment or solution. It’s great to know data moved from point A to point B, but your integration points will paint the full story of what happened to it along the way. Not all data lineage is created equal; the integration points and how those are surfaced are among the biggest differentiators. For example, are you curious how changes in dbt models may have impacted your data quality? If a failed Airflow job created a freshness issue? If a table feeds a particular dashboard? Well, if you are leveraging lineage from Dataplex or Databricks to resolve incidents across your environment, you’ll likely need to spend precious time piecing together information. Does your team use both Databricks and Snowflake and need to understand how data flows across both platforms? Let’s just say I wouldn’t hold our breath for that integration anytime soon. 4. The Right Tool for the Right Job Ultimately, this decision comes down to the advantages of using the right tool for the right job. Sure, your car has a CD player, but it would be pretty inconvenient to sit in your garage every time you’d like to hear some music. Not to mention the sound quality wouldn’t be as high, and the integration with Amazon Music account wouldn’t work. The parallel here is the overlap between data observability and data catalog solutions. Yes, both have data lineage features, but they are designed within many different contexts. For instance, Google developed its lineage features with compliance and governance use cases in mind, and Databricks has a lineage for cataloging and quality across native Databricks environments. So while data lineage may appear similar at first glance–spoiler alert: every platform’s graph will have boxes connected by lines–the real magic happens with the double click. For example, with Databricks, you can start with a high-level overview of the lineage and drill into a workflow. (Note: this would be only internal Databricks workflows, not external orchestrators.) You could then see a failed run time, and another click would take you to the code (not shown). Dataplex data lineage is similar with a depiction showing the relationships between datasets: The subsequent drill down allowing you to run an impact analysis is helpful, but for a “reporting and governance” use case. A data observability solution should take these high-level lineage diagrams a step further, down to the BI level, which, as previously mentioned, is critical for incident impact analysis. On the drill down, a data observability solution should provide all of the information shown across both tools plus a full history of queries run on the table, their runtimes, and associated jobs from dbt. Key data insights such as reads/write, schema changes, users, and the latest row count should be surfaced as well. Additionally, tables can be tagged (perhaps to denote their reliability level) with descriptions perhaps (to include information on SLAs and other relevant information). Taking a step beyond comparing lineage UIs for a moment, it’s important to also realize that you need a high-level overview of your data health. A data reliability dashboard–fueled by lineage metadata–can help you optimize your data quality investments by revealing your hot spots, uptime/SLA adherence, total incidents, time-to-fixed by domain, and more. Conclusion: Get the Sundae As data has become more crucial to business operations, the data space has exploded with many awesome and diverse tools. There are now 31 flavors instead of your typical spread of vanilla, chocolate, and strawberry. This can be as challenging for data engineers as it is exciting. Our best advice is to not get overwhelmed and let the use case drive the technology rather than vice versa. Ultimately, you will end up with an amazing, if sometimes messy, ice cream sundae with all of your favorite flavors perfectly balanced.
It's Monday morning, and your phone won't stop buzzing. You wake up to messages from your CEO saying, "The numbers in this report don't seem right...again." You and your team drop what you're doing and begin to troubleshoot the issue at hand. However, your teams' response to the incident is a mess. Other team members across the organization are repeating efforts, and your CMO is left in the dark while no updates are being sent out to the rest of the organization. As all of this is going on, you get texted by John in Finance about an errant table in his spreadsheet, and Eleanor in Operations about a query that pulled interesting results. What is a data engineer to do? If this situation sounds familiar to you, know that you are not alone. All too often, data engineers are straddled with the burden of not just fixing data issues, but prioritizing what to fix, how to fix it, and communicating status as the incident evolves. For many companies, data team responsibilities underlying this firefighting are often ambiguous, particularly when it relates to answering the question: “who is managing this incident?” Sure, data reliability SLAs should be managed by entire teams, but when the rubber hits the road, we need a dedicated persona to help call the shots and make sure these SLAs are met should data break. In software engineering, this role is often defined as an incident commander, and its core responsibilities include: Flagging incidents to the broader data team and stakeholders early and often. Maintain a working record of affected data assets or anomalies. Coordinating efforts and assigning responsibilities for a given incident. Circulating runbooks and playbooks as necessary. Assessing the severity and impact of the incident. Data teams should assign rotating incident commanders on a weekly or daily basis, or for specific data sets owned by specific functional teams. Establishing a good, repeatable practice of incident management (that delegates clear incident commanders) is primarily a cultural process, but investing in automation and maintaining a constant pulse on data health gets you much of the way there. The rest is education. Here are four key steps every incident manager must take when triaging and assessing the severity of a data issue: 1. Route Notifications to the Appropriate Team Members When responding to data incidents, the way your data organization is structured will impact your incident management workflow, and as a result, the incident commander process. Image courtesy of Monte Carlo If you sit on an embedded data team, it’s much easier to delegate incident response (i.e., the marketing data and analytics team owns all marketing analytics pipelines). Image courtesy of Monte Carlo If you sit on a centralized data team, fielding and routing these incident alerts to the appropriate owners requires a bit more foresight and planning. Either way, we suggest you set up dedicated Slack channels for data pipelines owned and maintained by specific members of your data team, inviting relevant stakeholders so they’re in the know if critical data they rely on is down. Many teams we work with set up PagerDuty or Opsgenie workflows to ensure that no bases are left uncovered. 2. Assess the Severity of the Incident Image courtesy of Monte Carlo Once the pipeline owner is notified that something is wrong with the data, the first step they should take is to assess the severity of the incident. Because data ecosystems are constantly evolving, there are an abundance of changes that can be introduced into your data pipelines at any given time. While some are harmless (i.e., expected schema change), some are much more lethal, causing impact to downstream stakeholders (i.e., rows in a critical table dropping from 10,000 to 1,000). Once your team starts troubleshooting the issue, it is a best practice to tag the issue based on its status, whether fixed, expected, investigating, no action needed, or false positive. Tagging the issue helps users with assessing the severity of the incident and also plays a key role in communicating the updates to relevant stakeholders in channels that are specific to the data that was affected so they can take appropriate action. What if a data asset breaks that isn’t important to your company? In fact, what if this data is deprecated? Phantom data haunts even the best data teams, and I can’t tell you how many times I have been on the receiving end of an alert for a data issue that, after all of the incident resolution was said and done, just did not matter to the business. So, instead of tackling high priority problems, I spent hours or even days firefighting broken data only to discover I was wasting my time. We have not used that table since 2019. So, how do you determine what data matters most to your organization? One increasingly common way teams have been able to discover their most critical data sets is by utilizing tools that help them visualize their data lineage. This allows them to have visibility into how all of their data sets are related when an incident does arise, and to be able to trace data ownership to alert the right people that might be affected by the issue. Once your team can figure out the severity of the impact, they will have a better understanding as to what priority level the error is. If it is data that is directly powering financial insights, or even how well your products are performing, it is likely a super high priority issue and your team should stop what they are doing to fix it ASAP. And if it's not, time to move on. 3. Communicate Status Updates as Often as Possible Good communication goes a long way in the heat of responding to a data incident, which is why we have already discussed how and why data teams should create a runbook that walks through (step-by-step) how to handle a given type of incident. Following a runbook is crucial to maintain correct lines of responsibility and reduce duplication of effort. Once you have “who does what” down, your team can then start updating a status page where stakeholders can follow along for updates in real time. A central status page also allows team members to see what others are working on and what the current status is of those incidents. In talks with customers, I have seen incident command delegation handled in one of two ways: Assign a team member to be on call to handle any incidents during a given time period: While on call, that person is responsible for handling all types of data incidents. Some teams have someone full time that does this for all incidents their team manages, while others have a schedule in place that rotates team members every week to cover. Team members responsible for covering certain tables: This is the most common structure we see. With this structure, team members handle all incidents related to their assigned tables or reports while doing their normal daily activities. Table assignment is generally aligned based on the data or pipelines a given member works with most closely. One important thing to keep in mind is that there is no right or wrong way here. Ultimately, it is just a matter of making sure that you commit to a process and stick with it. 4. Define and Align on Data SLAs and SLIs to Prevent Future Incidents and Downtime While the incident commander is not accountable for setting SLAs, they are often held responsible for meeting them. Simply put, service-level agreements (SLAs) are a method many companies use to define and measure the level of service a given vendor, product, or internal team will deliver — as well as potential remedies if they fail to deliver. For example, Slack’s customer-facing SLA promises 99.99% uptime every fiscal quarter, and no more than 10 hours of scheduled downtime, for customers on Plus plans and above. If they fall short, affected customers will receive service credits on their accounts for future use. Your service-level indicators (SLIs), quantitative measures of your SLAs, will depend on your specific use case, but here are a few metrics used to quantify incident response and data quality: The number of data incidents for a particular data asset (N): Although this may be beyond your control, given that you likely rely on external data sources, it’s still an important driver of data downtime and usually worth measuring. Time-to-detection (TTD): When an issue arises, this metric quantifies how quickly your team is alerted. If you don’t have proper detection and alerting methods in place, this could be measured in weeks or even months. “Silent errors” made by bad data can result in costly decisions, with repercussions for both your company and your customers. Time-to-resolution (TTR): When your team is alerted to an issue, this measures how quickly you were able to resolve it. By keeping track of these, data teams can work to reduce TTD and TTR, and in turn, build more reliable data systems. Why Data Incident Commanders Matter When it comes to responding to data incidents, time is of the essence, and as the incident commander, time is both your enemy and your best friend. In an ideal world, companies want data issues to be resolved as quickly as possible. However, that is not always the case and some teams often find themselves investigating data issues more frequently than they would like. In fact, while data teams invest a large amount of their time writing and updating custom data tests, they still experience broken pipelines. An incident commander, armed with the right processes, a pinch of automation, and organizational support, can work wonders for the reliability of your data pipelines. Your CEO will thank you later.
SQL (Structured Query Language) is a powerful and widely-used language for managing and manipulating data stored in relational databases. However, it’s important to be aware of common mistakes that can lead to bugs, security vulnerabilities, and poor performance in your SQL code. In this article, we’ll explore some of the most common mistakes made when writing SQL code and how to avoid them. 1. Not Properly Sanitizing User Input One common mistake made when writing SQL code is not properly sanitizing user input. This can lead to security vulnerabilities such as SQL injection attacks, where malicious users can inject harmful code into your database. To avoid this mistake, it’s important to always sanitize and validate user input before using it in your SQL queries. This can be done using techniques such as prepared statements and parameterized queries, which allows you to pass parameters to your queries in a secure manner. Here is an example of using a prepared statement with MySQL: PHP $mysqli = new mysqli("localhost", "username", "password", "database"); // Create a prepared statement $stmt = $mysqli->prepare("SELECT * FROM users WHERE email = ? AND password = ?"); // Bind the parameters $stmt->bind_param("ss", $email, $password); // Execute the statement $stmt->execute(); // Fetch the results $result = $stmt->get_result(); By properly sanitizing and validating user input, you can help protect your database from security vulnerabilities and ensure that your SQL code is reliable and robust. 2. Not Using Proper Indexes Proper indexing is important for optimizing the performance of your SQL queries. Without proper indexes, your queries may take longer to execute, especially if you have a large volume of data. To avoid this mistake, it’s important to carefully consider which columns to index and how to index them. You should also consider the data distribution and query patterns of your tables when choosing which columns to index. For example, if you have a table with a large number of rows and you frequently search for records based on a specific column, it may be beneficial to create an index on that column. On the other hand, if you have a small table with few rows and no specific search patterns, creating an index may not provide much benefit. It’s also important to consider the trade-offs of using different index types, such as B-tree, hash, and full-text indexes. Each type of index has its own benefits and drawbacks, and it’s important to choose the right index based on your needs. 3. Not Using Proper Data Types Choosing the right data type for your columns is important for ensuring that your data is stored efficiently and accurately. Using the wrong data type can lead to issues such as data loss, incorrect sorting, and poor performance. For example, using a VARCHAR data type for a column that contains only numeric values may result in slower queries and increased storage requirements. On the other hand, using an INT data type for a column that contains large amounts of text data may result in data loss. To avoid this mistake, it’s important to carefully consider the data types of your columns and choose the right data type based on the type and size of the data you are storing. It’s also a good idea to review the data types supported by your database system and choose the most appropriate data type for your needs. 4. Not Properly Normalizing Your Data Proper data normalization is important for ensuring that your data is organized efficiently and reduces redundancy. Without proper normalization, you may end up with data that is duplicated, difficult to update, or prone to inconsistencies. To avoid this mistake, it’s important to follow proper normalization principles, such as breaking up large tables into smaller ones and creating relationships between them using foreign keys. You should also consider the needs of your application and the type of data you are storing when deciding how to normalize your data. For example, if you have a table with a large number of columns and many of the columns are optional or only contain a few distinct values, it may be beneficial to break up the table into smaller ones and create relationships between them using foreign keys. 5. Not Using Proper SQL Syntax SQL has a specific syntax that must be followed in order for your queries to execute correctly. Failing to use proper syntax can lead to syntax errors and incorrect query results. To avoid this mistake, it’s important to carefully review the syntax of your SQL queries and ensure that you are using the correct syntax for the specific database system you are using. It’s also a good idea to use a SQL linter or syntax checker to identify any issues with your syntax. 6. Not Properly Organizing and Formatting Your Code Proper code organization and formatting is important for making your SQL code easier to read and understand. Without proper organization, your code may be difficult to maintain and debug. To avoid this mistake, it’s a good idea to follow standard SQL coding practices, such as using proper indentation, using uppercase for SQL keywords, and using descriptive names for your tables and columns. It’s also a good idea to use a code formatter to automatically format your code to follow these practices. By following proper code organization and formatting practices, you can make your SQL code easier to read and maintain. 7. Not Using Transactions Properly Transactions are an important feature of SQL that allow you to group multiple queries together and either commit or roll back the entire group as a single unit. Failing to use transactions properly can lead to inconsistencies in your data and make it more difficult to recover from errors. To avoid this mistake, it’s important to understand how transactions work and use them appropriately. This includes understanding the isolation levels of your database system and using the correct level for your needs. It’s also a good idea to use savepoints within your transactions to allow for finer-grained control over the rollback of individual queries. Here is an example of using transactions in MySQL: PHP $mysqli = new mysqli("localhost", "username", "password", "database"); // Start a transaction $mysqli->begin_transaction(); // Execute some queries $mysqli->query("INSERT INTO users (name, email) VALUES ('John', 'john@example.com')"); $mysqli->query("INSERT INTO orders (user_id, product_id) VALUES (LAST_INSERT_ID(), 123)"); // Commit the transaction $mysqli->commit(); By using transactions properly, you can ensure the consistency and integrity of your data and make it easier to recover from errors. 8. Not Properly Grouping and Aggregating Data Grouping and aggregating data is an important feature of SQL that allows you to perform calculations on large sets of data and retrieve the results in a summarized form. However, it’s important to use the right grouping and aggregation techniques to ensure that you are getting the results you expect. To avoid this mistake, it’s important to understand the different aggregation functions available in SQL and how to use them. Some common aggregation functions include COUNT, SUM, AVG, and MAX. It’s also important to use proper grouping techniques, such as using the GROUP BY and HAVING clauses, to ensure that you are grouping the data correctly. Here is an example of using aggregation and grouping in MySQL: MySQL SELECT COUNT(*) as num_orders, SUM(total_price) as total_revenue FROM orders GROUP BY user_id HAVING num_orders > 5 By properly grouping and aggregating your data, you can perform powerful calculations on large sets of data and retrieve the results in a summarized form. 9. Not Optimizing Performance Performance is important for ensuring that your SQL queries execute efficiently and do not impact the performance of your application. There are various techniques you can use to optimize the performance of your SQL queries, including proper indexing, optimization, and caching. To avoid this mistake, it’s important to carefully consider the performance of your SQL queries and use techniques such as EXPLAIN to analyze their performance. You should also consider using query optimization tools and techniques, such as covering indexes and query hints, to improve the performance of your queries. Here is an example of using EXPLAIN to analyze the performance of a SELECT query in MySQL: MySQL EXPLAIN SELECT * FROM users WHERE name = 'John'; By optimizing the performance of your SQL queries, you can ensure that your database is performing efficiently and your application is providing a good user experience. Conclusion In this article, we’ve explored some of the most common mistakes made when writing SQL code and how to avoid them. By following best practices and being aware of potential pitfalls, you can write more reliable and efficient SQL code and avoid common mistakes.
Samir Behara
Senior Cloud Infrastructure Architect,
AWS
Shai Almog
OSS Hacker, Developer Advocate and Entrepreneur,
Codename One
JJ Tang
Co-Founder,
Rootly
Sudip Sengupta
Technical Writer,
Javelynn