DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations

Maintenance

A developer's work is never truly finished once a feature or change is deployed. There is always a need for constant maintenance to ensure that a product or application continues to run as it should and is configured to scale. This Zone focuses on all your maintenance must-haves — from ensuring that your infrastructure is set up to manage various loads and improving software and data quality to tackling incident management, quality assurance, and more.

icon
Latest Refcards and Trend Reports
Refcard #346
Microservices and Workflow Engines
Microservices and Workflow Engines
Refcard #336
Value Stream Management Essentials
Value Stream Management Essentials
Refcard #332
Quality Assurance Patterns and Anti-Patterns
Quality Assurance Patterns and Anti-Patterns

DZone's Featured Maintenance Resources

Dynatrace Perform: Day One

Dynatrace Perform: Day One

By Tom Smith CORE
I attended Dynatrace Perform 2023. This was my sixth “Perform User Conference,” but the first over the last three years. Rick McConnell, CEO of Dynatrace, kicked off the event by sharing his thoughts on the company’s momentum and vision. The company is focused on adding value to the IT ecosystem and the cloud environment. As the world continues to change rapidly, this enables breakout opportunities to occur. Dynatrace strives to enable clients to be well-positioned for the upcoming post-Covid and post-recession recovery. The cloud delivers undeniable benefits for companies and their customers. It enables companies to deliver software and infrastructure much faster. That’s why we continue to see the growth of hyper-scale cloud providers. Companies rely on the cloud to take advantage of category and business growth opportunities. However, with the growth comes complexity. 71% of CIOs say increasing is difficult to manage all of the data that is being produced. It is beyond human ability to manage and make sense of all the data. This creates a need and opportunity for automation and observability to address these needs. There is an increased focus on cloud optimization on multiple fronts. Key areas of focus are to reduce costs and drive more reliability and availability to ultimately drive more value for customers. Observability is moving from an optional “nice to have” to a mandatory “must-have.” The industry is at an inflection point with an opportunity to drive change right now. Organizations need end-to-end observability. Dynatrace approaches the problem in a radically different way. Data types need to be looked at collectively and holistically to be more powerful in the management of the ecosystem. Observability + Security + Business Analytics The right software intelligence platform can provide end-to-end observability to drive transformational change to businesses by delivering answers and intelligent automation from data. End users are no longer willing to accept poor performance from applications. If your application doesn’t work or provides an inferior user experience, your customer will find another provider. As such, it is incumbent on businesses to deliver flawless and secure digital interactions that are performant and great all of the time. New Product Announcements Consistent with the vision of A world where software works perfectly, not having an incident in the first place, Dynatrace announced four new products today: Grail data lakehouse expansion: The Dynatrace Platform’s central data lakehouse technology that stores, contextualizes, and queries data, beyond logs and business events to encompass metrics, distributed traces, and multi-cloud topology and dependencies. This enhances the platform’s ability to store, process, and analyze the tremendous volume and variety of data from modern cloud environments while retaining full data context. Enhanced user experience: With new UX features such as built-in dashboard functionalities and a visual interface to help foster teamwork between technical and business personnel. These new UX features allow Dynatrace Notebooks to be used, which is an interactive document capability that allows IT, development, security, and business users to work together using code, text, as well as multimedia to construct, analyze, and disseminate insights from exploratory, causal-AI analytics projects to ensure better coordination and decision making throughout the company. Dynatrace AutomationEngine: Features an interactive user interface and no-code and low-code tools that empower groups to make use of Dynatrace’s causal-AI analytics for observability and security insights to automate BizDevSecOps procedures over their multi-cloud environments. This automation platform enables IT teams to detect and solve issues proactively or direct them to the right personnel, thus saving time and allowing them to concentrate on complex matters that only humans can handle. Dynatrace AppEngine: Provides IT, development, security, and business teams with the capability of designing tailored, consistent, and knowledge-informed applications with a user-friendly, minimal-code method. Clients and associates can build personalized links to sync the Dynatrace platform with technologies over hybrid and multi-cloud surroundings, unify segregated solutions, and enable more personnel from their businesses with smart apps that rely on perceptibility, security, and business knowledge from their ecosystems. Client Feedback I had the opportunity to speak with Michael Cabrera, Site Reliability Engineering Leader at Vivint. Michael brought SRE to Vivint after bringing SRE to Home Depot and Delta. Vivint realized they were spending more time firefighting than optimizing, and SRE helps solve this problem. Michael evaluated more than a dozen solutions comparing features, ease of use, and comprehensiveness of the platform. Dynatrace was a clear winner. It enables SRE and enables a view into what customers are experiencing not available with another tool. By seeing what customers feel, Michael and his team can be proactive versus reactive. The SRE team at Vivint has 12 engineers and 200 developers servicing thousands of employees. Field technicians are in customers’ homes, helping them create and live in smarter homes. Technicians are key stakeholders since they are front-facing to end users. Dynatrace is providing Vivint with a tighter loop between what customer experience and what they could see in the tech stack. It reduces the time spent troubleshooting and firefighting versus optimizing. Development teams can see how their code is performing. Engineers can see how the infrastructure is performing. Michael feels Grail is a game changer. It allows Vivint to combine logs with business analytics to achieve full observability end-to-end into their entire business. Vivint was a beta tester of the new technology. The tighter feedback loops with deployment showed how the company’s engineering policies could further improve. They were able to scale and review the performance of apps and infrastructure and see more interconnected services and how things align with each other. Dynatrace is helping Vivint to manage apps and software through SLOs. They have been able to set up SLOs in a couple of minutes. It’s easy to install with one agent without enabling plug-ins or buying add-ons. SREs can sit with engineering and product teams and show the experience from the tech stack to the customer. It’s great for engineering teams to have real-time feedback on performance. They can release code and see the performance before, during, and after. The biggest challenge is having so much more information than before. They are training to help team members know what to do with the information and how to drill down as needed. Conclusion I hope you have taken away some helpful information from my day one experience at Dynatrace Perform. To read more about my day two experience, read here. More
Developers' Guide: How to Execute Lift and Shift Migration

Developers' Guide: How to Execute Lift and Shift Migration

By Tejas Kaneriya
Are you looking to move your workloads from your on-premises environment to the cloud, but don't know where to start? Migrating your business applications and data to a new environment can be a daunting task, but it doesn't have to be. With the right strategy, you can execute a successful lift and shift migration in no time. Whether you're migrating to a cloud environment or just updating your on-premises infrastructure, this comprehensive guide will cover everything from planning and preparation to ongoing maintenance and support. In this article, I have provided the essential steps to execute a smooth lift and shift migration and make the transition to your new environment as seamless as possible. Preparation for Lift and Shift Migration Assess the Workloads for Migration Before starting the lift and shift migration process, it is important to assess the workloads that need to be migrated. This involves identifying the applications, data, and resources that are required for the migration. This assessment will help in determining the migration strategy, resource requirements, and timeline for the migration. Identify Dependencies and Potential Roadblocks This involves understanding the relationships between the workloads and identifying any dependencies that might impact the migration process. Potential roadblocks could include compatibility issues, security and data privacy concerns, and network limitations. By identifying these dependencies and roadblocks, you can plan and ensure a smooth migration process. Planning for Network and Security Changes Lift and shift migration often involves changes to the network and security configurations. It is important to plan for these changes in advance to ensure the integrity and security of the data being migrated. This includes defining the network architecture, creating firewall rules, and configuring security groups to ensure secure data transfer during the migration process. Lift and Shift Migration Lift and shift migration is a method used to transfer applications and data from one infrastructure to another. The goal is to recreate the current environment with minimum changes, making it easier for users and reducing downtime. Migration Strategies There are several strategies to migrate applications and data. A common approach is to use a combination of tools to ensure accurate and efficient data transfer. One strategy is to utilize a data migration tool. These tools automate the process of transferring data from one environment to another, reducing the risk of data loss or corruption. Some popular data migration tools include AWS Database Migration Service, Azure Database Migration Service, and Google Cloud Data Transfer. Another strategy is to use a cloud migration platform. These platforms simplify the process of moving the entire infrastructure, including applications, data, and networks, to the cloud. Popular cloud migration platforms include AWS Migration Hub, Azure Migrate, and Google Cloud Migrate. Testing and Validation Testing and validation play a crucial role in any migration project, including lift and shift migrations. To ensure success, it's essential to test applications and data before, during, and after the migration process. Before migration, test applications and data in the current environment to identify potential issues. During migration, conduct ongoing testing and validation to ensure accurate data transfer. After the migration is complete, final testing and validation should be done to confirm everything is functioning as expected. Managing and Monitoring Managing and monitoring the migration process is crucial for success. A project plan should be in place outlining the steps, timeline, budget, and resources needed. Understanding the tools and technologies used to manage and monitor the migration process is important, such as migration tools and platforms, and monitoring tools like AWS CloudTrail, Azure Monitor, and Google Cloud Stackdriver. Post-Migration Considerations Once your lift and shift migration is complete, it's important to turn your attention to the post-migration considerations. These considerations will help you optimize your migrated workloads, handle ongoing maintenance and updates, and address any lingering issues or challenges. Optimizing the Migrated Workloads for Performance One of the key post-migration considerations is optimizing the migrated workloads for performance. This is an important step because it ensures that your migrated applications and data are running smoothly and efficiently in the new environment. After a successful migration, it's crucial to ensure that your applications and data perform optimally in the new environment. To achieve this, you need to evaluate their performance in the new setup. This can be done through various performance monitoring tools like AWS CloudWatch, Azure Monitor, and Google Cloud Stackdriver. Upon examining the performance, you can identify areas that need improvement and make the necessary adjustments. This may include modifying the configuration of your applications and data or adding more resources to guarantee efficient performance. Handling Ongoing Maintenance and Updates Another important post-migration consideration is handling ongoing maintenance and updates. This is important because it ensures that your applications and data continue to run smoothly and efficiently, even after the migration is complete. To handle ongoing maintenance and updates, it's important to have a clear understanding of your infrastructure and the tools and technologies that you are using. You should also have a plan in place for how you will handle any updates or changes that may arise in the future. One of the key things to consider when it comes to maintenance and updates is having a regular schedule for updating your applications and data. This will help you stay on top of any changes that may need to be made, and will ensure that your workloads are running optimally at all times. Addressing Any Lingering Issues or Challenges It's crucial to resolve any unresolved problems or difficulties that occurred during the migration process. This guarantees that your applications and data run smoothly and efficiently and that any issues overlooked during migration are dealt with before they become bigger problems. To resolve lingering issues, it is necessary to have a good understanding of your infrastructure and the tools you use. Having a plan in place for handling future issues is also important. A key aspect of resolving lingering issues is to have a monitoring system in place for your applications and data. This helps to identify any problems and respond promptly. When Should You Consider the Lift and Shift Approach? The lift and shift approach allows you to convert capital expenses into operational ones by moving your applications and data to the cloud with little to no modification. This method can be beneficial in several scenarios, such as: When you need a complete cloud migration: The lift and shift method is ideal for transferring your existing applications to a more advanced and flexible cloud platform to manage future risks. When you want to save on costs: The lift and shift approach helps you save money by migrating your workloads to the cloud from on-premises with little modifications, avoiding the need for expensive licenses or hiring professionals. When you have limited expertise in cloud-native solutions: This approach is suitable when you need to move your data to the cloud quickly and with minimal investment and you have limited expertise in cloud-native solutions. When you don’t have proper documentation: The lift and shift method is also useful if you lack proper documentation, as it allows you to move your application to the cloud first, and optimize or replace it later. Conclusion Lift and shift migration is a critical step in modernizing legacy applications and taking advantage of the benefits of the cloud. The process can be complex and time-consuming, but careful planning and working with a knowledgeable vendor or using a reliable cloud migration tool can ensure a smooth and successful migration. Organizations can minimize downtime and risk of data loss, while increasing scalability, reliability, and reducing costs. Lift and shift migration is a smart choice for organizations looking to upgrade their technology and benefit from cloud computing. By following the best practices outlined in this article, organizations can achieve their goals and execute a successful lift and shift migration. More
Green Software and Carbon Hack
Green Software and Carbon Hack
By Beste Bayhan
The Problem With MTTR: Learning From Incident Reports
The Problem With MTTR: Learning From Incident Reports
By Dan Lines CORE
How to Engineer Your Technical Debt Response
How to Engineer Your Technical Debt Response
By Jason Bloomberg
Software Maintenance Models
Software Maintenance Models

Software maintenance may require different approaches based on your business goals, the industry you function in, the expertise of your tech team, and the predictive trends of the market. Therefore, along with understanding the different types of software maintenance, you also have to explore various models of software. Based on the kind of problem you are trying to solve, your team can choose the right model from the following options: 1. Quick-Fix Model A quick-fix model in software maintenance is a method for addressing bugs or issues in the software by prioritizing a fast resolution over a more comprehensive solution. This approach typically involves making a small, localized change to the software to address the immediate problem rather than fully understanding and addressing the underlying cause. However, organizations adopt this approach of maintenance only in the case of emergency situations that call for quick resolutions. Under the quick-fix model, tech teams carry out the following software maintenance activities: Annotate software changes by including change IDs and code comments Enter them into a maintenance history detailing why they made the change and the techniques used by them Note each location and merge them via the change ID if there are multiple points in the code change 2. Iterative Enhancement Model The iterative model is used for small-scale application modernization and scheduled maintenance. Generally, the business justification for changes is ignored in this approach as it only involves the software development team, not the business stakeholders. So, the software team will not know if more significant changes are required in the future, which is quite risky. The iterative enhancement model treats the application target as a known quantity. It incorporates changes in the software based on the analysis of the existing system. The iterative model best suits changes made to confined application targets, with little cross-impact on other apps or organizations. 3. Reuse-Oriented Model The reuse-oriented model identifies components of the existing system that are suitable to use again in multiple places. In recent years, this model also includes creating components that can be reused in multiple applications of a system.. There are three ways to incorporate the reuse-oriented model — object and function, application system, and component. Object and function reuse: This model reuses the software elements that implement a single well-defined object. Application system reuse: Under this model, developers can integrate new components in an application without making changes to the system or re-configuring it for a specific user to reuse. Component reuse: Component reuse refers to using a pre-existing component rather than creating a new one in software development. This can include using pre-built code libraries, frameworks, or entire software applications. 4. Boehm’s Model Introduced in 1978, Boehm’s model focuses on measuring characteristics to get non-tech stakeholders involved with the life cycle of software. The model represents a hierarchical structure of high-level, intermediate, and primitive characteristics of software that define its overall quality. The high-level characteristics of quality software are: Maintainability: It should be easy to understand, evaluate, and modify the processes in a system. Portability: Software systems should help in ascertaining the most effective way to make environmental changes As-is utility: It should be easy and effective to use an as-is utility in the system. The intermediate level of characteristics represented by the model displays different factors that validate the expected quality of a software system. These characteristics are: Reliability: Software performance is as expected, with zero defects. Portability: The software can run in various environments and on different platforms. Efficiency: The system makes optimum utilization of code, applications, and hardware resources. Testability: The software can be tested easily and the users can trust the results. Understandability: The end-user should be able to understand the functionality of the software easily and thus, use it effectively. Usability: Efforts needed to learn, use, and comprehend different software functions should be minimum. The primitive characteristics of quality software include basic features like device independence, accessibility, accuracy, etc. 5. Taute Maintenance Model Developed by B.J. Taute in 1983, the Taute maintenance model facilitates development teams to update and perform necessary modifications after executing the software. The Taute model for software maintenance can be carried out in the following phases: Change request phase: In this phase, the client sends the request to make changes to the software in a prescribed format. Estimate phase: Then, developers conduct an impact analysis on the existing system to estimate the time and effort required to make the requested changes. Schedule phase: Here, the team aggregates the change requests for the upcoming scheduled release and creates the planning documents accordingly. Programming phase: In the programming phase, requested changes are implemented in the source code, and all the relevant documents, like design documents and manuals, are updated accordingly. Test phase: During this phase, the software modifications are carefully analyzed. The code is tested using existing and new test cases, along with the implementation of regression testing. Documentation phase: Before the release, system and user documentation are prepared and updated based on regression testing results. Thus, developers can maintain the coherence of documents and code. Release phase: The customer receives the new software product and updated documentation. Then the system’s end users perform acceptance testing. Conclusion Software maintenance is not just a necessary chore, but an essential aspect of any successful software development project. By investing in ongoing maintenance and addressing issues as they arise, organizations can ensure that their software remains reliable, secure, and up-to-date. From bug fixes to performance optimizations, software maintenance is a crucial step in maximizing the value and longevity of your software. So don't overlook this critical aspect of software development — prioritize maintenance and keep your software running smoothly for years to come.

By Hiren Dhaduk
What Is Policy-as-Code? An Introduction to Open Policy Agent
What Is Policy-as-Code? An Introduction to Open Policy Agent

In the cloud-native era, we often hear that "security is job zero," which means it's even more important than any number one priority. Modern infrastructure and methodologies bring us enormous benefits, but, at the same time, since there are more moving parts, there are more things to worry about: How do you control access to your infrastructure? Between services? Who can access what? Etc. There are many questions to be answered, including policies: a bunch of security rules, criteria, and conditions. Examples: Who can access this resource? Which subnet egress traffic is allowed from? Which clusters a workload must be deployed to? Which protocols are not allowed for reachable servers from the Internet? Which registry binaries can be downloaded from? Which OS capabilities can a container execute with? Which times of day can the system be accessed? All organizations have policies since they encode important knowledge about compliance with legal requirements, work within technical constraints, avoid repeating mistakes, etc. Since policies are so important today, let's dive deeper into how to best handle them in the cloud-native era. Why Policy-as-Code? Policies are based on written or unwritten rules that permeate an organization's culture. So, for example, there might be a written rule in our organizations explicitly saying: For servers accessible from the Internet on a public subnet, it's not a good practice to expose a port using the non-secure "HTTP" protocol. How do we enforce it? If we create infrastructure manually, a four-eye principle may help. But first, always have a second guy together when doing something critical. If we do Infrastructure as Code and create our infrastructure automatically with tools like Terraform, a code review could help. However, the traditional policy enforcement process has a few significant drawbacks: You can't be guaranteed this policy will never be broken. People can't be aware of all the policies at all times, and it's not practical to manually check against a list of policies. For code reviews, even senior engineers will not likely catch all potential issues every single time. Even though we've got the best teams in the world that can enforce policies with no exceptions, it's difficult, if possible, to scale. Modern organizations are more likely to be agile, which means many employees, services, and teams continue to grow. There is no way to physically staff a security team to protect all of those assets using traditional techniques. Policies could be (and will be) breached sooner or later because of human error. It's not a question of "if" but "when." And that's precisely why most organizations (if not all) do regular security checks and compliance reviews before a major release, for example. We violate policies first and then create ex post facto fixes. I know, this doesn't sound right. What's the proper way of managing and enforcing policies, then? You've probably already guessed the answer, and you are right. Read on. What Is Policy-as-Code (PaC)? As business, teams, and maturity progress, we'll want to shift from manual policy definition to something more manageable and repeatable at the enterprise scale. How do we do that? First, we can learn from successful experiments in managing systems at scale: Infrastructure-as-Code (IaC): treat the content that defines your environments and infrastructure as source code. DevOps: the combination of people, process, and automation to achieve "continuous everything," continuously delivering value to end users. Policy-as-Code (PaC) is born from these ideas. Policy as code uses code to define and manage policies, which are rules and conditions. Policies are defined, updated, shared, and enforced using code and leveraging Source Code Management (SCM) tools. By keeping policy definitions in source code control, whenever a change is made, it can be tested, validated, and then executed. The goal of PaC is not to detect policy violations but to prevent them. This leverages the DevOps automation capabilities instead of relying on manual processes, allowing teams to move more quickly and reducing the potential for mistakes due to human error. Policy-as-Code vs. Infrastructure-as-Code The "as code" movement isn't new anymore; it aims at "continuous everything." The concept of PaC may sound similar to Infrastructure as Code (IaC), but while IaC focuses on infrastructure and provisioning, PaC improves security operations, compliance management, data management, and beyond. PaC can be integrated with IaC to automatically enforce infrastructural policies. Now that we've got the PaC vs. IaC question sorted out, let's look at the tools for implementing PaC. Introduction to Open Policy Agent (OPA) The Open Policy Agent (OPA, pronounced "oh-pa") is a Cloud Native Computing Foundation incubating project. It is an open-source, general-purpose policy engine that aims to provide a common framework for applying policy-as-code to any domain. OPA provides a high-level declarative language (Rego, pronounced "ray-go," purpose-built for policies) that lets you specify policy as code. As a result, you can define, implement and enforce policies in microservices, Kubernetes, CI/CD pipelines, API gateways, and more. In short, OPA works in a way that decouples decision-making from policy enforcement. When a policy decision needs to be made, you query OPA with structured data (e.g., JSON) as input, then OPA returns the decision: Policy Decoupling OK, less talk, more work: show me the code. Simple Demo: Open Policy Agent Example Pre-requisite To get started, download an OPA binary for your platform from GitHub releases: On macOS (64-bit): curl -L -o opa https://openpolicyagent.org/downloads/v0.46.1/opa_darwin_amd64 chmod 755 ./opa Tested on M1 mac, works as well. Spec Let's start with a simple example to achieve an Access Based Access Control (ABAC) for a fictional Payroll microservice. The rule is simple: you can only access your salary information or your subordinates', not anyone else's. So, if you are bob, and john is your subordinate, then you can access the following: /getSalary/bob /getSalary/john But accessing /getSalary/alice as user bob would not be possible. Input Data and Rego File Let's say we have the structured input data (input.json file): { "user": "bob", "method": "GET", "path": ["getSalary", "bob"], "managers": { "bob": ["john"] } } And let's create a Rego file. Here we won't bother too much with the syntax of Rego, but the comments would give you a good understanding of what this piece of code does: File example.rego: package example default allow = false # default: not allow allow = true { # allow if: input.method == "GET" # method is GET input.path = ["getSalary", person] input.user == person # input user is the person } allow = true { # allow if: input.method == "GET" # method is GET input.path = ["getSalary", person] managers := input.managers[input.user][_] contains(managers, person) # input user is the person's manager } Run The following should evaluate to true: ./opa eval -i input.json -d example.rego "data.example" Changing the path in the input.json file to "path": ["getSalary", "john"], it still evaluates to true, since the second rule allows a manager to check their subordinates' salary. However, if we change the path in the input.json file to "path": ["getSalary", "alice"], it would evaluate to false. Here we go. Now we have a simple working solution of ABAC for microservices! Policy as Code Integrations The example above is very simple and only useful to grasp the basics of how OPA works. But OPA is much more powerful and can be integrated with many of today's mainstream tools and platforms, like: Kubernetes Envoy AWS CloudFormation Docker Terraform Kafka Ceph And more. To quickly demonstrate OPA's capabilities, here is an example of Terraform code defining an auto-scaling group and a server on AWS: With this Rego code, we can calculate a score based on the Terraform plan and return a decision according to the policy. It's super easy to automate the process: terraform plan -out tfplan to create the Terraform plan terraform show -json tfplan | jq > tfplan.json to convert the plan into JSON format opa exec --decision terraform/analysis/authz --bundle policy/ tfplan.json to get the result.

By Tiexin Guo
GitOps: Flux vs Argo CD
GitOps: Flux vs Argo CD

GitOps is a software development and operations methodology that uses Git as the source of truth for deployment configurations. It involves keeping the desired state of an application or infrastructure in a Git repository and using Git-based workflows to manage and deploy changes. Two popular open-source tools that help organizations implement GitOps for managing their Kubernetes applications are Flux and Argo CD. In this article, we’ll take a closer look at these tools, their pros and cons, and how to set them up. Common Use Cases for Flux and Argo CD Flux Continuous delivery: Flux can be used to automate the deployment pipeline and ensure that changes are automatically deployed as soon as they are pushed to the Git repository. Configuration management: Flux allows you to store and manage your application’s configuration as code, making it easier to version control and track changes. Immutable infrastructure: Flux helps enforce an immutable infrastructure approach—where changes are made only through the Git repository and not through manual intervention on the cluster. Blue-green deployments: Flux supports blue-green deployments—where a new version of an application is deployed alongside the existing version, and traffic is gradually shifted to the new version. Argo CD Continuous deployment: Argo CD can be used to automate the deployment process, ensuring that applications are always up-to-date with the latest changes from the Git repository. Application promotion: Argo CD supports application promotion—where applications can be promoted from one environment to another. For example, from development to production. Multi-cluster management: Argo CD can be used to manage applications across multiple clusters, ensuring the desired state of the applications is consistent across all clusters. Rollback management: Argo CD provides rollback capabilities, making it easier to revert changes in case of failures. The choice between the two tools depends on the specific requirements of the organization and application, but both tools provide a GitOps approach to simplify the deployment process and reduce the risk of manual errors. They both have their own pros and cons, and in this article, we’ll take a look at what they are and how to set them up. What Is Flux? Flux is a GitOps tool that automates the deployment of applications on Kubernetes. It works by continuously monitoring the state of a Git repository and applying any changes to a cluster. Flux integrates with various Git providers such as GitHub, GitLab, and Bitbucket. When changes are made to the repository, Flux automatically detects them and updates the cluster accordingly. Pros of Flux Automated deployments: Flux automates the deployment process, reducing manual errors and freeing up developers to focus on other tasks. Git-based workflow: Flux leverages Git as a source of truth, which makes it easier to track and revert changes. Declarative configuration: Flux uses Kubernetes manifests to define the desired state of a cluster, making it easier to manage and track changes. Cons of Flux Limited customization: Flux only supports a limited set of customizations, which may not be suitable for all use cases. Steep learning curve: Flux has a steep learning curve for new users and requires a deep understanding of Kubernetes and Git. How To Set Up Flux Prerequisites A running Kubernetes cluster. Helm installed on your local machine. A Git repository for your application's source code and Kubernetes manifests. The repository URL and a SSH key for the Git repository. Step 1: Add the Flux Helm Repository The first step is to add the Flux Helm repository to your local machine. Run the following command to add the repository: Shell helm repo add fluxcd https://charts.fluxcd.io Step 2: Install Flux Now that the Flux Helm repository is added, you can install Flux on the cluster. Run the following command to install Flux: Shell helm upgrade -i flux fluxcd/flux \ --set git.url=git@github.com:<your-org>/<your-repo>.git \ --set git.path=<path-to-manifests> \ --set git.pollInterval=1m \ --set git.ssh.secretName=flux-git-ssh In the above command, replace the placeholder values with your own Git repository information. The git.url parameter is the URL of the Git repository, the git.path parameter is the path to the directory containing the Kubernetes manifests, and the git.ssh.secretName parameter is the name of the SSH secret containing the SSH key for the repository. Step 3: Verify the Installation After running the above command, you can verify the installation by checking the status of the Flux pods. Run the following command to view the pods: Shell kubectl get pods -n <flux-namespace> If the pods are running, Flux has been installed successfully. Step 4: Connect Flux to Your Git Repository The final step is to connect Flux to your Git repository. Run the following command to generate a SSH key and create a secret: Shell ssh-keygen -t rsa -b 4096 -f id_rsa kubectl create secret generic flux-git-ssh \ --from-file=id_rsa=./id_rsa --namespace=<flux-namespace> In the above command, replace the <flux-namespace> placeholder with the namespace where Flux is installed. Now, add the generated public key as a deployment key in your Git repository. You have successfully set up Flux using Helm. Whenever changes are made to the Git repository, Flux will detect them and update the cluster accordingly. In conclusion, setting up Flux using Helm is a quite simple process. By using Git as a source of truth and continuously monitoring the state of the cluster, Flux helps simplify the deployment process and reduce the risk of manual errors. What Is Argo CD? Argo CD is an open-source GitOps tool that automates the deployment of applications on Kubernetes. It allows developers to declaratively manage their applications and keeps the desired state of the applications in sync with the live state. Argo CD integrates with Git repositories and continuously monitors them for changes. Whenever changes are detected, Argo CD applies them to the cluster, ensuring the application is always up-to-date. With Argo CD, organizations can automate their deployment process, reduce the risk of manual errors, and benefit from Git’s version control capabilities. Argo CD provides a graphical user interface and a command-line interface, making it easy to use and manage applications at scale. Pros of Argo CD Advanced deployment features: Argo CD provides advanced deployment features, such as rolling updates and canary deployments, making it easier to manage complex deployments. User-friendly interface: Argo CD provides a user-friendly interface that makes it easier to manage deployments, especially for non-technical users. Customizable: Argo CD allows for greater customization, making it easier to fit the tool to specific use cases. Cons of Argo CD Steep learning curve: Argo CD has a steep learning curve for new users and requires a deep understanding of Kubernetes and Git. Complexity: Argo CD has a more complex architecture than Flux, which can make it more difficult to manage and troubleshoot. How To Set Up Argo CD Argo CD can be installed on a Kubernetes cluster using Helm, a package manager for Kubernetes. In this section, we’ll go through the steps to set up Argo CD using Helm. Prerequisites A running Kubernetes cluster. Helm installed on your local machine. A Git repository for your application’s source code and Kubernetes manifests. Step 1: Add the Argo CD Helm Repository The first step is to add the Argo CD Helm repository to your local machine. Run the following command to add the repository: Shell helm repo add argo https://argoproj.github.io/argo-cd Step 2: Install Argo CD Now that the Argo CD Helm repository is added, you can install Argo CD on the cluster. Run the following command to install Argo CD: Shell helm upgrade -i argocd argo/argo-cd --set server.route.enabled=true Step 3: Verify the Installation After running the above command, you can verify the installation by checking the status of the Argo CD pods. Run the following command to view the pods: Shell kubectl get pods -n argocd If the pods are running, Argo CD has been installed successfully. Step 4: Connect Argo CD to Your Git Repository The final step is to connect Argo CD to your Git repository. Argo CD provides a graphical user interface that you can use to create applications and connect to your Git repository. To access the Argo CD interface, run the following command to get the URL: Shell kubectl get routes -n argocd Use the URL in a web browser to access the Argo CD interface. Once you’re in the interface, you can create a new application by providing the Git repository URL and the path to the Kubernetes manifests. Argo CD will continuously monitor the repository for changes and apply them to the cluster. You have now successfully set up Argo CD using Helm. Conclusion GitOps is a valuable approach for automating the deployment and management of applications on Kubernetes. Flux and Argo CD are two popular GitOps tools that provide a simple and efficient way to automate the deployment process, enforce an immutable infrastructure, and manage applications in a consistent and predictable way. Flux focuses on automating the deployment pipeline and providing configuration management as code, while Argo CD provides a more complete GitOps solution, including features such as multi-cluster management, application promotion, and rollback management. Both tools have their own strengths and weaknesses, and the choice between the two will depend on the specific requirements of the organization and the application. Regardless of the tool chosen, GitOps provides a valuable approach for simplifying the deployment process and reducing the risk of manual errors. By keeping the desired state of the applications in sync with the Git repository, GitOps ensures that changes are made in a consistent and predictable way, resulting in a more reliable and efficient deployment process.

By Kushagra Shandilya
Disaster Recovery Guide for IT Infrastructures
Disaster Recovery Guide for IT Infrastructures

Modern organizations need complex IT infrastructures functioning properly to provide goods and services at the expected level of performance. Therefore, losing critical parts or the whole infrastructure can put the organization on the edge of disappearance. Disasters remain a threat to production processes. What Is a Disaster? A disaster is challenging trouble that instantly overwhelms the capacity of available human, IT, financial and other resources and results in significant losses of valuable assets (for example, documents, intellectual property objects, data, or hardware). In most cases, a disaster is a sudden chain of events causing non-typical threats that are difficult or impossible to stop once the disaster starts. Depending on the type of disaster, an organization needs to react in specific ways. There are three main types of disasters: Natural disasters Technological and human-made disasters Hybrid disasters A natural disaster is the first thing that probably comes to your mind when you hear the word “disaster”. Different types of natural disasters include floods, earthquakes, forest fires, abnormal heat, intense snowfalls, heavy rains, hurricanes and tornadoes, and sea and ocean storms. Technological disaster is the consequence of anything connected with the malfunctions of tech infrastructure, human error, or evil will. The list can include any issue from a software disruption in an organization to a power plant problem causing difficulties in the whole city, region, or even country. These are disasters such as global software disruption, critical hardware malfunction, power outages, and electricity supply problems, malware infiltration (including ransomware attacks), telecommunication issues (including network isolation), military conflicts, terrorism incidents, dam failures, chemical incidents. The third category to mention describes mixed disasters that unite the features of natural and technological factors. For example, a dam failure can cause a flood resulting in a power outage and communication issues across the entire region or country. What Is Disaster Recovery? Disaster recovery (DR) is a set of actions (methodology) that an organization should take to recover and restore operations after a global disruptive event. Major disaster recovery activities focus on regaining access to data, hardware, software, network devices, connectivity, and power supply. DR actions can also cover rebuilding logistics, and relocating staff members and office equipment, in case of damaged or destroyed assets. To create a disaster recovery plan, you need to think over the action sequences to complete during these periods: Before the disaster (building, maintaining, and testing the DR system and policies). During the disaster (applying the immediate response measures to avoid or mitigate asset losses). After the disaster (applying the DR system to restore operation, contacting clients, partners, and officials, and analyzing losses and recovery efficiency). Here are the points to include in your disaster recovery plan. Business Impact Analysis and Risk Assessment Data At this step, you study threats and vulnerabilities typical and most dangerous for your organization. With that knowledge, you can also calculate the probability of a particular disaster occurring, measure potential impacts on your production and implement suitable disaster recovery solutions easier. Recovery Objectives: Defined RPO and RTO RPO is the recovery point objective: the parameter defines the amount of data you can lose without a significant impact on production. RTO is the recovery time objective: the longest downtime your organization can tolerate and, thus, the maximum time you can have to complete recovery workflows. Distribution of Responsibilities A team that is aware of every member’s duties in case of disaster is a must-have component of an efficient DR plan. Assemble a special DR team, assign specific roles to every employee and train them to fulfill their roles before an actual disaster strikes. This is the way to avoid confusion and missing links when real action is required to save an organization’s assets and production. DR Site Creation A disaster of any scale or nature can critically damage your main server and production office, making resuming operations there impossible or extraordinarily time-consuming. In this situation, a prepared DR site with replicas of critical workloads is the best choice to minimize RTO and continue providing services to the organization’s clients during and after in an emergency. Failback Preparations Failback, which is the process of returning the workloads back to the main site when the main data center is operational again, can be overlooked when planning disaster recovery. Nevertheless, establishing failback sequences beforehand helps to make the entire process smoother and avoid minor data losses that might happen otherwise. Additionally, keep in mind that a DR site is usually not designed to support your infrastructure’s functioning for a prolonged period. Remote Storage for Crucial Documents and Assets Even small organizations produce and process a lot of crucial data nowadays. Losing hard copies or digital documents can make their recovery time-consuming, expensive, or even impossible. Thus, preparing remote storage (for example, VPS cloud storage for digital docs and protected physical storage for hard copy assets) is a solid choice to ensure the accessibility of important data in case of disaster. You can check the all-in-one solution for VMware disaster recovery at once if you want. Equipment Requirements Noted This DR plan element requires auditing the nodes that enable the functioning of your organization’s IT infrastructure. This includes computers, physical servers, network routers, hard drives, cloud-based server hosting equipment, etc. That knowledge enables you to view the elements required to restore the original state of the IT environment after a disaster. What’s more, you can see the list of equipment required to support at least mission-critical workloads and ensure production continuity when the main resource is unavailable. Communication Channels Defined Ensure enabling a stable and reliable internal communication system for your staff members, management, and DR team. Set the order of communication channels’ usage to deal with the unavailability of the main server and internal network right after a disaster. Response Procedures Outlined In a DR plan, the first hours are critical. Create step-by-step instructions on how to execute DR activities, monitor and conduct processes, failover sequences, system recovery verification, etc. In case a disaster still hits the production center despite all the prevention measures applied, a concentrated and rapid response to a particular event can help mitigate the damage. Incident Reporting to Stakeholders After a disaster strikes and disrupts your production, not only DR team members should be informed. You also need to notify key stakeholders, including your marketing team, third-party suppliers, partners, and clients. As a part of your disaster recovery plan, create outlines and scripts showing your staff how to inform every critical group regarding its concerns. Additionally, a basic press release created beforehand can help you not to waste time during an actual incident. DR Plan Testing and Adjustment Successful organizations change and expand with time, and their DR plans should be adjusted according to the relevant needs and recovery objectives. Test your plan right after you finish it, and perform additional testing every time you introduce changes. Thus, you can measure the efficiency of a disaster recovery plan and ensure the recoverability of your assets. Optimal DR Strategy Applied The DR strategy can be implemented on a DIY (do it yourself) basis or delegated to a third-party vendor. The former choice is the way to sacrifice reliability in favor of the economy, while the latter one can be more expensive but more efficient. The choice of a DR strategy fully depends on your organization’s features, including the team size, IT infrastructure complexity, budget, risk factors, and desired reliability, among others. Summary A disaster is a sudden destructive event that can render an organization inoperable. Natural, human-made, and hybrid disasters have different levels of predictability, but they are barely preventable at an organization’s level. The only way to ensure the safety of an organization is to create a reliable disaster recovery plan based on the organization’s specific needs. The key elements of a DR plan are: Risk assessment and impact analysis, Defined RPO and RTO DR team responsibilities distributed DR site creation Preparations for failback Remote storage Equipment list Established communication channels Immediate response sequences Incident reporting instructions Disaster recovery testing and adjustment. Optimal DR strategy choice.

By Alex Tray
Multiplying Software Quality Using Three Documentation Types
Multiplying Software Quality Using Three Documentation Types

You know that documentation is crucial in any software development, mainly because it increases the quality, decreases the number of meetings, and makes the team more scalable. The question is how to start a new repository or a non-documentation project. This project will clarify how to start documentation in a regular source code repository. The documentation on the source repository applies as tactical documentation. There are also strategic ones that cover the architecture, such as the C4 model, tech radar, and architecture decision record (ADR), which we won't cover in this tutorial. Before starting, we'll use AsciiDoc instead of Markdown. AsciiDoc has several features to reduce the boilerplate and has more capabilities than Markdown. Furthermore, AsciiDoc has support for most Markdown syntax; thus, it will be smooth to use and migrate to Asciidoc. README The README file is onboard any Git repository. That is the first contact of a developer on the source code. This file must have a brief context, such as: An introductory paragraph explaining why the repository exists The goal of bullets A Getting Started section of the repository and How To Install (i.e., in a Maven repository project, you can add the Maven dependency. A highlight of the API Markdown = Project Name :toc: auto == Introduction A paragraph that explains the "why" or reason for this project exists. == Goals * The goals on bullets * The second goal == Getting Started Your reader gets into here; they need to know how to use and install it. == The API overview The coolest features here == To know more More references such as books, articles, videos, and so on. == Samples * https://github.com/spring-projects/spring-data-commons[Spring data commons] * https://github.com/eclipse/jnosql[JNoSQL] * https://github.com/xmolecules/jmolecules[jmolecules] We have our first file and the overview of the project, what I can and cannot do, and the next step is going to the historical moment of the project and what was released on any version in our subsequent documentation. Changelog We can make an analogy with each version. The changelog has all the notable changes in a single file, starting with the latest version; thus, the developer knows what has changed on each version. The main goal is to be easier than Git history. Accordingly, it should have the crucial moments of each performance. Briefly, each version has the date and the version number, plus categories of changes such as added, changed, fixed, and removed. The source code below shows a case. Markdown = Changelog :toc: auto All notable changes to this project will be documented in this file. The format is based on https://keepachangelog.com/en/1.0.0/[Keep a Changelog], and this project adheres to https://semver.org/spec/v2.0.0.html[Semantic Versioning]. == [Unreleased] === Added - Create === Changed - Changed === Removed - Remove === Fixed - Ops, fixed == [old-version] - 2022-08-04 Great! Now, anyone in the team can see the changes by version and what the source repository does without going to several meetings or spending time to find the proposal of this repository. Whereas those types are outside the code, let's go deep inside the code. Code Documentation Yes, documentation inside the code helps the maintainability of any software. It is worth mentioning as many as possible. It is mainly to destroy the idea of a self-documenting code because it is a utopia in the IT area. You can do both good code, good documentation, and a test that helps with documentation. You can explore the source code to be complementary, explain the code design's why, and bring the business context to the code. Remember, the tests also help with documentation. Please, don't use the source code to explain documentation. It is a waste of documentation and a bad practice. On Java, you can explore the Javadoc capabilities; if you are not, don't worry about which language has a specific tool for it. Conclusion This article explains three documentation types to start a regular source code repository. Please, be aware it is a start, and documentation may vary with the code project, such as OpenAPI with Swagger when we talk about REST API. Documentation is crucial, and we can see several colossal software such as open-source projects like Java, Linux, Go, etc., have documentation. Indeed, with several projects, proposals, languages, and architectural styles, documentation as the first citizen is the expected behavior among them. I hope you understand how meaningful documentation is and good codes, and I hope that you make your and your company simpler with more documentation in the source code repository.

By Otavio Santana CORE
The Engineering Leader’s Guide to Code Quality Metrics
The Engineering Leader’s Guide to Code Quality Metrics

Tech moves fast. As deadlines get closer, it can be tempting to code fast to get it shipped as quickly as possible. But fast and good don't always equate. And worse, where there's dud code, there's a chump stuck refactoring their predecessors' dirty code and dealing with a mountain of technical debt. And no one wants that to be you. Fortunately, engineering leads can set the standard with practices prioritizing code quality metrics and health. In this article, I’ll cover… Why code quality matters The most important code quality metrics How to handle code quality issues Why Code Quality Matters Code quality is crucial for software development as it determines how well the codebase functions, scales, and maintains over time. Good quality code ensures the software is stable, secure, and efficient. On the other hand, poor-quality code can lead to bugs, security vulnerabilities, performance issues, and scalability problems. Code quality impacts the amount of technical debt that accumulates over time. Technical debt refers to the cost of maintaining a codebase developed with suboptimal practices or shortcuts. It accumulates over time, and if left unchecked, it can make it difficult or impossible to maintain or improve the codebase without significant effort or cost. Due to deadlines, budgets, and other constraints, developers are often forced to make tradeoffs between quality and speed, and accumulating technical debt is inevitable. However, accumulating technical debt imprudently can increase the risk of bugs and security vulnerabilities, make adding features harder, slow development, and even require a complete codebase rewrite in extreme cases. But how do we get to this place? No single metric or focus point can fully capture code quality. Rather it's the combined efforts that help build and maintain healthy codebases. Here are some of the mission-critical metrics you need to prioritize: Code Churn Let's be clear. Code churn is a normal part of the development process. But the amount of code churn also helps you measure your development team's efficiency and productivity. If the bulk of a dev's time is spent writing and rewriting the same bits of code, it can suggest two things: The problem code is being tested, workshopped, and refined. This is a good thing, especially when dealing with a sticky problem or challenge that requires regular reworking as part of an evolutionary process. Time spent learning that results in a win is always well-spent. Code churn in response to customer feedback is a plus. There's a bigger problem. Code churn that consistency results in low work output can be a sign of: Poor programming practices Lack of experience or skill at self-editing Low morale or burnout External woes like ever-changing demands from other departments or external stakeholders. (Yep, this is why they pay you the big bucks). 3. Besides talking regularly with your team about their process and regular code reviews, you can also measure code churn using various tools such as: Azure DevOps Server NCover Minware Git Coding Conventions In most companies, marketing and editorial departments have a style guide, a living document that outlines the rules for written content — think everything from grammar to whether you write 1M or 1 million. Dev teams should also have coding conventions — they might be specific to particular projects or across the whole team. After all, code is also about interpersonal communication, not just machines. And a shared language helps create quality, clean code by improving readability and maintainability. It also helps team members understand each other's code and easily find files or classes. Some common coding conventions: Naming Conventions Comments Spaces Operators Declarations Indentation File Organization Architectural Patterns You might think of using a code linter to highlight code smells or security gaps, but you can also use it to monitor code conventions. Tools include: Prettier (JavaScript) Rubocop (Ruby) StyleCop (C#) And a whole lot of open-source options. Get at it! While at it, spend some time on an agreed-upon approach to documentation. Think of it as passing a torch or baton of knowledge from past you to future you. Code Complexity Ok, some code is just an intricate maze born of pain. But it may also be an impenetrable jungle that's impossible for anyone to understand except the original person who wrote it. It gets shuffled to the bottom of the pile every time the team talks about refactoring to reduce technical debt. And remember, many devs only stay in one role for a couple of years, so that person could spread their jungle of complex code like a virus across multiple organizations if left unchecked. Terrifying. But back to business. Simply put, code complexity refers to how difficult code is to understand, maintain, and modify. Overly complex code can also be at risk of bugs and may resist new ads. There are two main methods of measuring code complexity: Cyclomatic Complexity measures the complexity of a program based on the number of independent paths through the code. It is a quantitative measure of the number of decision points in the code, such as if statements and for/while loops. The more paths through your code, the more potential problem areas in the code that may require further testing or refactoring. Halstead Complexity Metrics, which measures the size and difficulty of a program based on the program length, vocabulary size, the number of distinct operators and operands, the number of statements, and the number of bugs per thousand lines of code. These metrics evaluate software quality and maintainability and predict the effort required to develop, debug, and maintain software systems. That said, Halstead Complexity Metrics don't factor in other important metrics like code readability, performance, security, and usability, so it's not something to use as a stand-alone tool. Check out Verifysoft if you want to dig into the metrics more. Code Maintainability The code maintainability is pretty much what it sounds like – how easy it is to maintain and modify a codebase over time. Poorly maintained code causes delays as it takes longer to update and can stand out as a critical flaw against your competitors. But done well, it pulls together all the good stuff, including code clarity, complexity, naming conventions, code refactoring, documentation, version control, code reviews, and unit testing. So, it's about a broader commitment to code quality rather than a single task or tool. And it's always worth remembering that, like refactoring, you'll maintain your and other people's code a long time after the first write. Fortunately, there are automation tools aplenty from the get-go when it comes to maintainability. For example, static code analysis tools identify bug risks, anti-patterns, performance issues, and unused code. New Bugs vs. Closed Bugs Every bug in your software is a small piece of technical debt that accumulates over time. To keep track of your technical debt, your engineers must track and document every bug, including when they are fixed. One way to measure your code quality is to compare the number of new bugs discovered to the number of bugs that have been closed. If new bugs consistently outnumber closed bugs, it's a sign that your technical debt is increasing and that you need to take action to manage it effectively. Tracking new bugs versus closed bugs helps identify potential issues in the development process, such as insufficient testing, poor quality control, or lack of resources for fixing bugs. By comparing, you can make informed decisions about allocating resources, prioritizing bug fixes, and improving your development practices to reduce technical debt and improve code quality over time. I co-founded Stepsize to fix this problem. Our tool helps modern engineering teams improve code quality by making it easy to identify, track, prioritize, and fix technical debt or code quality issues. Create, view, and manage issues directly from your codebase with the Stepsize tool. Issues are linked to code, making them easy to understand and prioritize. Use the codebase visualization tool to understand the distribution of tech debt and code quality issues in the codebase. Use powerful filters to understand the impact on your codebase, product, team, and business priorities. Rounding Up It's important to remember that effective code quality management involves more than just relying on a single metric or tool. Instead, engineering leads need to prioritize embedding a commitment to code quality tasks and tools into the daily workflow to ensure consistent improvement over time. This includes helping team members develop good code hygiene through habit stacking and skill development, which can significantly benefit their careers. Tracking and prioritizing technical debt is a critical aspect of increasing code quality. By doing so, teams can make a strong business case for refactoring the essential parts of their codebase, leading to more efficient and maintainable software in the long run.

By Cate Lawrence CORE
Using QuestDB to Collect Infrastructure Metrics
Using QuestDB to Collect Infrastructure Metrics

One of my favorite things about QuestDB is the ability to write queries in SQL against a high-performance time series database. Since I’ve been using SQL as my primary query language for basically my entire professional career, it feels natural for me to interact with data using SQL instead of other newer proprietary query languages. Combined with QuestDB’s custom SQL extensions, its built-in SQL support makes writing complex queries a breeze. In my life as a cloud engineer, I deal with time series metrics all the time. Unfortunately, many of today’s popular metrics databases don’t support the SQL query language. As a result, I’ve become more dependent on pre-built dashboards, and it takes me longer to write my own queries with JOINs, transformations, and temporal aggregations. QuestDB can be a great choice for ingesting application and infrastructure metrics, it just requires a little more work on the initial setup than the Kubernetes tooling du jour. Despite this extra upfront time investment (which is fairly minimal in the grand scheme of things), I think the benefits of using QuestDB for infrastructure metrics are worth it. In this article, I will demonstrate how we use QuestDB as the main component in this new feature. This should provide enough information for you to also use QuestDB for ingesting, storing, and querying infrastructure metrics in your own clusters. Architecture Prometheus is a common time series database that is already installed in many Kubernetes clusters. We will be leveraging its remote write functionality to pipe data into QuestDB for querying and storage. However, since Prometheus remote write does not support the QuestDB-recommended InfluxDB Line Protocol (ILP) as a serialization format, we need to use a proxy to translate Prometheus-formatted metrics into ILP messages. We will use InfluxData’s Telegraf as this translation component. Now, with our data in QuestDB, we can use SQL to query our metrics using any one of the supported methods: the Web Console, PostgreSQL wire protocol, or HTTP REST API. Here’s a quick overview of the architecture: Prometheus Remote Write While Prometheus operates on an interval-based pull model, it also has the ability to push metrics to remote sources. This is known as “remote write” capability, and is easily configurable in a YAML file. Here’s an example of a basic remote write configuration: YAML remoteWrite: - url: http://default.telegraf.svc:9999/write name: questdb-telegraf remote_timeout: 10s This YAML will configure Prometheus to send samples to the specified URL with a 10-second timeout. In this case, we will be forwarding our metrics on to Telegraf, with a custom port and endpoint that we can specify in the Telegraf config (see below for more details). There are also a variety of other remote write options, allowing users to customize timeouts, headers, authentication, and additional relabling configs before writing to the remote data store. All of the possible options can be found on the Prometheus website. QuestDB ILP and Telegraf Now that we have our remote write configured, we need to set up its destination. Installing Telegraf into a cluster is straightforward, just helm install its Helm chart. We do need to configure Telegraf to read from a web socket (where Prometheus is configured to write to) and send it to QuestDB for long-term storage. In a Kubernetes deployment, these options can be set in the config section of the Telegraf Helm chart’s values.yaml file. Input Configuration Since Telegraf will be receiving metrics from Prometheus, we need to open a port that enables communication between the two services. Telegraf has an HTTP listener plugin that allows it to listen for traffic on a specified port. We also need to configure the path of the listener to match our Prometheus remote write URL. The HTTP listener (v2) supports multiple data formats to consume via its plugin architecture. A full list of options can be found in the Telegraf docs. We will be using the Prometheus Remote Write Parser Plugin to accept our Prometheus messages. Here is how this setup looks in the Telegraf config: TOML [[inputs.http_listener_v2]] ## Address and port to host HTTP listener on service_address = ":9999" ## Paths to listen to. paths = ["/write"] ## Data format to consume. data_format = "prometheusremotewrite" When passing these values to the Helm chart, you can use this yaml specification: YAML config: inputs: - http_listener_v2: service_address: ":9999" path: "/write" data_format: prometheusremotewrite Output Configuration We recommend that you use the InfluxDB Line Protocol (ILP) over TCP to insert data into QuestDB. Luckily, Telegraf includes an ILP output plugin. But unfortunately, this is not a plug-and-play solution. By default, all metrics will be written to a single measurement, prometheus_remote_write, with the individual metric’s key being sent over the wire as a field. In practice, this means all of your metrics will be written to a single QuestDB table, called prometheus_remote_write. There will then be an additional column for every single metric and field you are capturing. This leads to a large table, with potentially thousands of columns, that’s difficult to work with and contains all sparse data, which could negatively impact performance. To fix this problem, Telegraf provides us with a sample starlark script that transforms each measurement such that we will have a table-per-metric in QuestDB. This script will run for every metric that Telegraf receives, so the output will be formatted correctly. This is what Telegraf’s output config looks like: TOML [[outputs.socket_writer]] ## Address and port to write to address = "tcp://questdb.questdb.svc:9009" [[processors.starlark]] source = ''' def apply(metric): if metric.name == "prometheus_remote_write": for k, v in metric.fields.items(): metric.name = k metric.fields["value"] = v metric.fields.pop(k) return metric ''' As an added benefit to using ILP with QuestDB, we don’t have to worry about each metric’s fieldset. Over ILP, QuestDB automatically creates tables for new metrics. It also adds new columns for fields that it hasn’t seen before, and INSERTs nulls for any missing fields. Helm Configuration I’ve found that the easiest way to configure the values.yaml file is to mount the starlark script as a volume, and add a reference to it in the config. This way we don’t need to deal with any whitespace-handling or special indentation in our ConfigMap specification. The output and starlark Helm configuration would look like this: YAML # continued from above # config: outputs: - socket_writer: address: tcp://questdb.questdb.svc:9009 processors: - starlark: script: /opt/telegraf/remotewrite.star We also need to add the volume and mount at the root level of the values.yaml: YAML volumes: - name: starlark-script configMap: name: starlark-script mountPoints: - name: starlark-script mountPath: /opt/telegraf subpath: remotewrite.star This volume references a ConfigMap that contains the starlark script from the above example: YAML --- apiVersion: v1 kind: ConfigMap metadata: name: starlark-script data: remotewrite.star: | def apply(metric): ... Querying Metrics With SQL QuestDB has some powerful SQL extensions that can simplify writing time series queries. For example, given the standard set of metrics that a typical Prometheus installation collects, we can use QuestDB to not only find pods with the highest memory usage in a cluster (over a six-month period) but also find the specific time period when the memory usage spiked. We can even access custom labels to help identify the pods with a human-readable name (instead of the long alphanumeric name assigned to pods by deployments or stateful sets). This is all performed with a simple SQL syntax using JOINs (enhanced by the ASOF keyword) and SAMPLE BY to bucket data into days with a simple line of SQL: SQL SELECT l.label_app_kubernetes_io_custom_name, w.timestamp, max(w.value / r.value) as mem_usage FROM container_memory_working_set_bytes AS w ASOF JOIN kube_pod_labels AS l ON (w.pod = l.pod) ASOF JOIN kube_pod_container_resource_limits AS r ON ( r.pod = w.pod AND r.container = w.container ) WHERE label_app_kubernetes_io_custom_name IS NOT NULL AND r.resource = 'memory' AND w.timestamp > '2022-06-01' SAMPLE BY 1d ALIGN TO CALENDAR TIME ZONE 'Europe/Berlin' ORDER BY mem_usage DESC; Here’s a sample output of that query: label_app_kubernetes_io_custom_name timestamp mem_usage keen austin 2022-07-04T16:18:00.000000Z 0.999853875401 optimistic banzai 2022-07-12T16:18:00.000000Z 0.9763028946 compassionate taussig 2022-07-11T16:18:00.000000Z 0.975367909527 cranky leakey 2022-07-11T16:18:00.000000Z 0.974941994418 quirky morse 2022-07-05T16:18:00.000000Z 0.95084235665 admiring panini 2022-06-21T16:18:00.000000Z 0.925567626953 This is only one of many ways that you can use QuestDB to write powerful time-series queries that you can use for one-off investigation or to power dashboards. Metric Retention Since databases storing infrastructure metrics can grow to extreme sizes over time, it is important to enforce a retention period to free up space by deleting old metrics. Even though QuestDB does not support the traditional DELETE SQL command, you can still implement metric retention by using the DROP PARTITION command. In QuestDB, data is stored by columns on-disk and optionally partitioned by a time duration. By default, when using ILP to ingest metrics, and a new table is automatically created, it is partitioned by DAY. This allows us to DROP PARTITIONs on a daily basis. If you need a different partitioning scheme, you can create the table with your desired partition period before ingesting any data over ILP, since ALTER TABLE does not support any changes to table partitioning. But since ILP automatically adds columns, the table specification can be very simple, with just the name and a timestamp column. Once you’ve decided on your desired metric retention period, you can create a cron job that removes all partitions older than your oldest retention date. This will help keep your storage usage in check. Working Example I have created a working example of this setup in a repo, sklarsa/questdb-metrics-blog-post. The entire example runs in a local Kind cluster. To run the example, execute the following commands: Shell git clone https://github.com/sklarsa/questdb-metrics-blog-post.git cd questdb-metrics-blog-post ./run.sh After a few minutes, all pods should be ready with the following prompt: Plain Text You can now access QuestDB here: http://localhost:9000 Ctrl-C to exit Forwarding from 127.0.0.1:9000 -> 9000 Forwarding from [::1]:9000 -> 9000 From here, you can navigate to http://localhost:9000 and explore the metrics that are being ingested into QuestDB. The default Prometheus scrape interval is thirty seconds, so there might not be a ton of data in there, but you should see a list of tables, one per each metric that we are collecting: Once you’re done, you can clean up the entire experiment by deleting the cluster: Shell ./cleanup.sh Conclusion QuestDB can be a very powerful piece in the Cloud Engineer’s toolkit. It grants you the ability to run complex time-series queries across multiple metrics with unparalleled speed in the world's most ubiquitous query language, SQL. Every second counts when debugging an outage at 2AM, and reducing the cognitive load of writing queries, as well as their execution time, is a game-changer for me.

By Steve Sklar
The 31 Flavors of Data Lineage and Why Vanilla Doesn’t Cut It
The 31 Flavors of Data Lineage and Why Vanilla Doesn’t Cut It

Data lineage, an automated visualization of the relationships for how data flows across tables and other data assets, is a must-have in the data engineering toolbox. Not only is it helpful for data governance and compliance use cases, but it also plays a starring role as one of the five pillars of data observability. Data lineage accelerates a data engineer’s ability to understand the root cause of a data anomaly and the potential impact it may have on the business. As a result, data lineage’s popularity as a must-have component of modern data tooling has skyrocketed faster than a high schooler with parents traveling out of town for the weekend. As a result, almost all data catalogs have introduced data lineage in the last few years. More recently, some big data cloud providers, such as Databricks and Google (as part of Dataplex), have announced data lineage capabilities. It’s great to see that so many leaders in the space, like Databricks and Google, realize the value of lineage for use cases across the data stack, from data governance to discovery. But now that there are multiple solutions offering some flavor of data lineage, the question arises, “does it still need to be a required feature within a data quality solution?” The answer is an unequivocal “yes.” When it comes to tackling data reliability, vanilla lineage just doesn’t cut it. Here’s why… 1. Data Lineage Informs Incident Detection and Alerting Data lineage powers better data quality incident detection and alerting when it’s natively integrated within a data observability platform. For example, imagine you have an issue with a table upstream that cascades into multiple other tables across several downstream layers. Do you want your team to get one alert, or do you want to get 15 – all for the same incident? The first option accurately depicts the full context along with a natural point to start your root cause analysis. The second option is akin to receiving 15 pages of a book out of order and hoping your on-call data engineer is able to piece together they are all part of a single story. As a function of data observability, data lineage pieces together this story automatically, identifying which one is the climax and which ones are just falling into action. Not to mention, too many superfluous alerts are the quickest route to alert fatigue–scientifically defined as the point where the data engineer rolls their eyes, shakes their head, and moves on to another task. So when your incident management channel in Slack has more than 25 unread messages, all corresponding to the same incident, are you really getting value from your data observability platform? One way to help combat alert fatigue and improve incident detection is to set alert parameters to only notify you about anomalies with your most important tables. However, without native data lineage, it’s difficult and time-consuming to understand what assets truly are important. One of the keys to operationalizing data observability is to ensure alerts are routed to the right responders–those who best understand the domain and particular systems in question. Data lineage can help surface and route alerts to the appropriate owners on both the data team and business stakeholder sides of the house. 2. Data Lineage Accelerates Incident Resolution Data engineers are able to fix broken pipelines and anomalous data faster when data lineage is natively incorporated within the data observability platform. Without it, you just have a list of incidents and a map of table/field dependencies, neither of which is particularly useful without the other. Without incidents embedded in lineage, those dots aren’t connected–and they certainly aren’t connected to how data is consumed within your organization. For example, data lineage is essential to the incident triage process. To butcher a proverb, “If a table experiences an anomaly, but no one consumes data from it, do you care?” Tracing incidents upstream across two different tools is a disjointed process. You don’t just want to swim to the rawest upstream table; you want to swim up to the most upstream table where the issue is still present. Of course, once we arrive at our most upstream table with an anomaly, our root cause analysis process has just begun. Data lineage gives you the where but not always the why. Data teams must now determine if it is: A systems issue: Did an Airflow job not run? Were there issues with permissions in Snowflake? A code issue: Did someone modify a SQL query or dbt model that mucked everything up? A data issue: Did a third party send us garbage data filled with NULLs and other nonsense? Data lineage is valuable, but it is not a silver bullet for incident resolution. It is at its best when it works within a larger ecosystem of incident resolution tools such as query change detection, high correlation insights, and anomalous row detection. 3. A Single Pane of Glass Sometimes vendors say their solution provides “a single pane of glass” with a bit too much robotic reverence and without enough critical thought toward the value provided. Nice to look at but not very useful. How I imagine some vendors say, “a single pane of glass.” In the case of data observability, however, a single pane of glass is integral to how efficient and effective your data team can be in its data reliability workflows. I previously mentioned the disjointed nature of cross-referencing your list of incidents to your map of data incidents. Still, it’s important to remember data pipelines extend beyond a single environment or solution. It’s great to know data moved from point A to point B, but your integration points will paint the full story of what happened to it along the way. Not all data lineage is created equal; the integration points and how those are surfaced are among the biggest differentiators. For example, are you curious how changes in dbt models may have impacted your data quality? If a failed Airflow job created a freshness issue? If a table feeds a particular dashboard? Well, if you are leveraging lineage from Dataplex or Databricks to resolve incidents across your environment, you’ll likely need to spend precious time piecing together information. Does your team use both Databricks and Snowflake and need to understand how data flows across both platforms? Let’s just say I wouldn’t hold our breath for that integration anytime soon. 4. The Right Tool for the Right Job Ultimately, this decision comes down to the advantages of using the right tool for the right job. Sure, your car has a CD player, but it would be pretty inconvenient to sit in your garage every time you’d like to hear some music. Not to mention the sound quality wouldn’t be as high, and the integration with Amazon Music account wouldn’t work. The parallel here is the overlap between data observability and data catalog solutions. Yes, both have data lineage features, but they are designed within many different contexts. For instance, Google developed its lineage features with compliance and governance use cases in mind, and Databricks has a lineage for cataloging and quality across native Databricks environments. So while data lineage may appear similar at first glance–spoiler alert: every platform’s graph will have boxes connected by lines–the real magic happens with the double click. For example, with Databricks, you can start with a high-level overview of the lineage and drill into a workflow. (Note: this would be only internal Databricks workflows, not external orchestrators.) You could then see a failed run time, and another click would take you to the code (not shown). Dataplex data lineage is similar with a depiction showing the relationships between datasets: The subsequent drill down allowing you to run an impact analysis is helpful, but for a “reporting and governance” use case. A data observability solution should take these high-level lineage diagrams a step further, down to the BI level, which, as previously mentioned, is critical for incident impact analysis. On the drill down, a data observability solution should provide all of the information shown across both tools plus a full history of queries run on the table, their runtimes, and associated jobs from dbt. Key data insights such as reads/write, schema changes, users, and the latest row count should be surfaced as well. Additionally, tables can be tagged (perhaps to denote their reliability level) with descriptions perhaps (to include information on SLAs and other relevant information). Taking a step beyond comparing lineage UIs for a moment, it’s important to also realize that you need a high-level overview of your data health. A data reliability dashboard–fueled by lineage metadata–can help you optimize your data quality investments by revealing your hot spots, uptime/SLA adherence, total incidents, time-to-fixed by domain, and more. Conclusion: Get the Sundae As data has become more crucial to business operations, the data space has exploded with many awesome and diverse tools. There are now 31 flavors instead of your typical spread of vanilla, chocolate, and strawberry. This can be as challenging for data engineers as it is exciting. Our best advice is to not get overwhelmed and let the use case drive the technology rather than vice versa. Ultimately, you will end up with an amazing, if sometimes messy, ice cream sundae with all of your favorite flavors perfectly balanced.

By Lior Gavish
How to Quarterback Data Incident Response
How to Quarterback Data Incident Response

It's Monday morning, and your phone won't stop buzzing. You wake up to messages from your CEO saying, "The numbers in this report don't seem right...again." You and your team drop what you're doing and begin to troubleshoot the issue at hand. However, your teams' response to the incident is a mess. Other team members across the organization are repeating efforts, and your CMO is left in the dark while no updates are being sent out to the rest of the organization. As all of this is going on, you get texted by John in Finance about an errant table in his spreadsheet, and Eleanor in Operations about a query that pulled interesting results. What is a data engineer to do? If this situation sounds familiar to you, know that you are not alone. All too often, data engineers are straddled with the burden of not just fixing data issues, but prioritizing what to fix, how to fix it, and communicating status as the incident evolves. For many companies, data team responsibilities underlying this firefighting are often ambiguous, particularly when it relates to answering the question: “who is managing this incident?” Sure, data reliability SLAs should be managed by entire teams, but when the rubber hits the road, we need a dedicated persona to help call the shots and make sure these SLAs are met should data break. In software engineering, this role is often defined as an incident commander, and its core responsibilities include: Flagging incidents to the broader data team and stakeholders early and often. Maintain a working record of affected data assets or anomalies. Coordinating efforts and assigning responsibilities for a given incident. Circulating runbooks and playbooks as necessary. Assessing the severity and impact of the incident. Data teams should assign rotating incident commanders on a weekly or daily basis, or for specific data sets owned by specific functional teams. Establishing a good, repeatable practice of incident management (that delegates clear incident commanders) is primarily a cultural process, but investing in automation and maintaining a constant pulse on data health gets you much of the way there. The rest is education. Here are four key steps every incident manager must take when triaging and assessing the severity of a data issue: 1. Route Notifications to the Appropriate Team Members When responding to data incidents, the way your data organization is structured will impact your incident management workflow, and as a result, the incident commander process. Image courtesy of Monte Carlo If you sit on an embedded data team, it’s much easier to delegate incident response (i.e., the marketing data and analytics team owns all marketing analytics pipelines). Image courtesy of Monte Carlo If you sit on a centralized data team, fielding and routing these incident alerts to the appropriate owners requires a bit more foresight and planning. Either way, we suggest you set up dedicated Slack channels for data pipelines owned and maintained by specific members of your data team, inviting relevant stakeholders so they’re in the know if critical data they rely on is down. Many teams we work with set up PagerDuty or Opsgenie workflows to ensure that no bases are left uncovered. 2. Assess the Severity of the Incident Image courtesy of Monte Carlo Once the pipeline owner is notified that something is wrong with the data, the first step they should take is to assess the severity of the incident. Because data ecosystems are constantly evolving, there are an abundance of changes that can be introduced into your data pipelines at any given time. While some are harmless (i.e., expected schema change), some are much more lethal, causing impact to downstream stakeholders (i.e., rows in a critical table dropping from 10,000 to 1,000). Once your team starts troubleshooting the issue, it is a best practice to tag the issue based on its status, whether fixed, expected, investigating, no action needed, or false positive. Tagging the issue helps users with assessing the severity of the incident and also plays a key role in communicating the updates to relevant stakeholders in channels that are specific to the data that was affected so they can take appropriate action. What if a data asset breaks that isn’t important to your company? In fact, what if this data is deprecated? Phantom data haunts even the best data teams, and I can’t tell you how many times I have been on the receiving end of an alert for a data issue that, after all of the incident resolution was said and done, just did not matter to the business. So, instead of tackling high priority problems, I spent hours or even days firefighting broken data only to discover I was wasting my time. We have not used that table since 2019. So, how do you determine what data matters most to your organization? One increasingly common way teams have been able to discover their most critical data sets is by utilizing tools that help them visualize their data lineage. This allows them to have visibility into how all of their data sets are related when an incident does arise, and to be able to trace data ownership to alert the right people that might be affected by the issue. Once your team can figure out the severity of the impact, they will have a better understanding as to what priority level the error is. If it is data that is directly powering financial insights, or even how well your products are performing, it is likely a super high priority issue and your team should stop what they are doing to fix it ASAP. And if it's not, time to move on. 3. Communicate Status Updates as Often as Possible Good communication goes a long way in the heat of responding to a data incident, which is why we have already discussed how and why data teams should create a runbook that walks through (step-by-step) how to handle a given type of incident. Following a runbook is crucial to maintain correct lines of responsibility and reduce duplication of effort. Once you have “who does what” down, your team can then start updating a status page where stakeholders can follow along for updates in real time. A central status page also allows team members to see what others are working on and what the current status is of those incidents. In talks with customers, I have seen incident command delegation handled in one of two ways: Assign a team member to be on call to handle any incidents during a given time period: While on call, that person is responsible for handling all types of data incidents. Some teams have someone full time that does this for all incidents their team manages, while others have a schedule in place that rotates team members every week to cover. Team members responsible for covering certain tables: This is the most common structure we see. With this structure, team members handle all incidents related to their assigned tables or reports while doing their normal daily activities. Table assignment is generally aligned based on the data or pipelines a given member works with most closely. One important thing to keep in mind is that there is no right or wrong way here. Ultimately, it is just a matter of making sure that you commit to a process and stick with it. 4. Define and Align on Data SLAs and SLIs to Prevent Future Incidents and Downtime While the incident commander is not accountable for setting SLAs, they are often held responsible for meeting them. Simply put, service-level agreements (SLAs) are a method many companies use to define and measure the level of service a given vendor, product, or internal team will deliver — as well as potential remedies if they fail to deliver. For example, Slack’s customer-facing SLA promises 99.99% uptime every fiscal quarter, and no more than 10 hours of scheduled downtime, for customers on Plus plans and above. If they fall short, affected customers will receive service credits on their accounts for future use. Your service-level indicators (SLIs), quantitative measures of your SLAs, will depend on your specific use case, but here are a few metrics used to quantify incident response and data quality: The number of data incidents for a particular data asset (N): Although this may be beyond your control, given that you likely rely on external data sources, it’s still an important driver of data downtime and usually worth measuring. Time-to-detection (TTD): When an issue arises, this metric quantifies how quickly your team is alerted. If you don’t have proper detection and alerting methods in place, this could be measured in weeks or even months. “Silent errors” made by bad data can result in costly decisions, with repercussions for both your company and your customers. Time-to-resolution (TTR): When your team is alerted to an issue, this measures how quickly you were able to resolve it. By keeping track of these, data teams can work to reduce TTD and TTR, and in turn, build more reliable data systems. Why Data Incident Commanders Matter When it comes to responding to data incidents, time is of the essence, and as the incident commander, time is both your enemy and your best friend. In an ideal world, companies want data issues to be resolved as quickly as possible. However, that is not always the case and some teams often find themselves investigating data issues more frequently than they would like. In fact, while data teams invest a large amount of their time writing and updating custom data tests, they still experience broken pipelines. An incident commander, armed with the right processes, a pinch of automation, and organizational support, can work wonders for the reliability of your data pipelines. Your CEO will thank you later.

By Glen Willis
Common Mistakes to Avoid When Writing SQL Code
Common Mistakes to Avoid When Writing SQL Code

SQL (Structured Query Language) is a powerful and widely-used language for managing and manipulating data stored in relational databases. However, it’s important to be aware of common mistakes that can lead to bugs, security vulnerabilities, and poor performance in your SQL code. In this article, we’ll explore some of the most common mistakes made when writing SQL code and how to avoid them. 1. Not Properly Sanitizing User Input One common mistake made when writing SQL code is not properly sanitizing user input. This can lead to security vulnerabilities such as SQL injection attacks, where malicious users can inject harmful code into your database. To avoid this mistake, it’s important to always sanitize and validate user input before using it in your SQL queries. This can be done using techniques such as prepared statements and parameterized queries, which allows you to pass parameters to your queries in a secure manner. Here is an example of using a prepared statement with MySQL: PHP $mysqli = new mysqli("localhost", "username", "password", "database"); // Create a prepared statement $stmt = $mysqli->prepare("SELECT * FROM users WHERE email = ? AND password = ?"); // Bind the parameters $stmt->bind_param("ss", $email, $password); // Execute the statement $stmt->execute(); // Fetch the results $result = $stmt->get_result(); By properly sanitizing and validating user input, you can help protect your database from security vulnerabilities and ensure that your SQL code is reliable and robust. 2. Not Using Proper Indexes Proper indexing is important for optimizing the performance of your SQL queries. Without proper indexes, your queries may take longer to execute, especially if you have a large volume of data. To avoid this mistake, it’s important to carefully consider which columns to index and how to index them. You should also consider the data distribution and query patterns of your tables when choosing which columns to index. For example, if you have a table with a large number of rows and you frequently search for records based on a specific column, it may be beneficial to create an index on that column. On the other hand, if you have a small table with few rows and no specific search patterns, creating an index may not provide much benefit. It’s also important to consider the trade-offs of using different index types, such as B-tree, hash, and full-text indexes. Each type of index has its own benefits and drawbacks, and it’s important to choose the right index based on your needs. 3. Not Using Proper Data Types Choosing the right data type for your columns is important for ensuring that your data is stored efficiently and accurately. Using the wrong data type can lead to issues such as data loss, incorrect sorting, and poor performance. For example, using a VARCHAR data type for a column that contains only numeric values may result in slower queries and increased storage requirements. On the other hand, using an INT data type for a column that contains large amounts of text data may result in data loss. To avoid this mistake, it’s important to carefully consider the data types of your columns and choose the right data type based on the type and size of the data you are storing. It’s also a good idea to review the data types supported by your database system and choose the most appropriate data type for your needs. 4. Not Properly Normalizing Your Data Proper data normalization is important for ensuring that your data is organized efficiently and reduces redundancy. Without proper normalization, you may end up with data that is duplicated, difficult to update, or prone to inconsistencies. To avoid this mistake, it’s important to follow proper normalization principles, such as breaking up large tables into smaller ones and creating relationships between them using foreign keys. You should also consider the needs of your application and the type of data you are storing when deciding how to normalize your data. For example, if you have a table with a large number of columns and many of the columns are optional or only contain a few distinct values, it may be beneficial to break up the table into smaller ones and create relationships between them using foreign keys. 5. Not Using Proper SQL Syntax SQL has a specific syntax that must be followed in order for your queries to execute correctly. Failing to use proper syntax can lead to syntax errors and incorrect query results. To avoid this mistake, it’s important to carefully review the syntax of your SQL queries and ensure that you are using the correct syntax for the specific database system you are using. It’s also a good idea to use a SQL linter or syntax checker to identify any issues with your syntax. 6. Not Properly Organizing and Formatting Your Code Proper code organization and formatting is important for making your SQL code easier to read and understand. Without proper organization, your code may be difficult to maintain and debug. To avoid this mistake, it’s a good idea to follow standard SQL coding practices, such as using proper indentation, using uppercase for SQL keywords, and using descriptive names for your tables and columns. It’s also a good idea to use a code formatter to automatically format your code to follow these practices. By following proper code organization and formatting practices, you can make your SQL code easier to read and maintain. 7. Not Using Transactions Properly Transactions are an important feature of SQL that allow you to group multiple queries together and either commit or roll back the entire group as a single unit. Failing to use transactions properly can lead to inconsistencies in your data and make it more difficult to recover from errors. To avoid this mistake, it’s important to understand how transactions work and use them appropriately. This includes understanding the isolation levels of your database system and using the correct level for your needs. It’s also a good idea to use savepoints within your transactions to allow for finer-grained control over the rollback of individual queries. Here is an example of using transactions in MySQL: PHP $mysqli = new mysqli("localhost", "username", "password", "database"); // Start a transaction $mysqli->begin_transaction(); // Execute some queries $mysqli->query("INSERT INTO users (name, email) VALUES ('John', 'john@example.com')"); $mysqli->query("INSERT INTO orders (user_id, product_id) VALUES (LAST_INSERT_ID(), 123)"); // Commit the transaction $mysqli->commit(); By using transactions properly, you can ensure the consistency and integrity of your data and make it easier to recover from errors. 8. Not Properly Grouping and Aggregating Data Grouping and aggregating data is an important feature of SQL that allows you to perform calculations on large sets of data and retrieve the results in a summarized form. However, it’s important to use the right grouping and aggregation techniques to ensure that you are getting the results you expect. To avoid this mistake, it’s important to understand the different aggregation functions available in SQL and how to use them. Some common aggregation functions include COUNT, SUM, AVG, and MAX. It’s also important to use proper grouping techniques, such as using the GROUP BY and HAVING clauses, to ensure that you are grouping the data correctly. Here is an example of using aggregation and grouping in MySQL: MySQL SELECT COUNT(*) as num_orders, SUM(total_price) as total_revenue FROM orders GROUP BY user_id HAVING num_orders > 5 By properly grouping and aggregating your data, you can perform powerful calculations on large sets of data and retrieve the results in a summarized form. 9. Not Optimizing Performance Performance is important for ensuring that your SQL queries execute efficiently and do not impact the performance of your application. There are various techniques you can use to optimize the performance of your SQL queries, including proper indexing, optimization, and caching. To avoid this mistake, it’s important to carefully consider the performance of your SQL queries and use techniques such as EXPLAIN to analyze their performance. You should also consider using query optimization tools and techniques, such as covering indexes and query hints, to improve the performance of your queries. Here is an example of using EXPLAIN to analyze the performance of a SELECT query in MySQL: MySQL EXPLAIN SELECT * FROM users WHERE name = 'John'; By optimizing the performance of your SQL queries, you can ensure that your database is performing efficiently and your application is providing a good user experience. Conclusion In this article, we’ve explored some of the most common mistakes made when writing SQL code and how to avoid them. By following best practices and being aware of potential pitfalls, you can write more reliable and efficient SQL code and avoid common mistakes.

By Theophilus Kolawole

Top Maintenance Experts

expert thumbnail

Samir Behara

Senior Cloud Infrastructure Architect,
AWS

Samir Behara builds software solutions using cutting edge technologies. He is a Microsoft Data Platform MVP with over 15 years of IT experience. Samir is a frequent speaker at technical conferences and is the Co-Chapter Lead of the Steel City SQL Server UserGroup. He is the author of www.samirbehara.com
expert thumbnail

Shai Almog

OSS Hacker, Developer Advocate and Entrepreneur,
Codename One

Software developer with ~30 years of professional experience in a multitude of platforms/languages. JavaOne rockstar/highly rated speaker, author, blogger and open source hacker. Shai has extensive experience in the full stack of backend, desktop and mobile. This includes going all the way into the internals of VM implementation, debuggers etc. Shai started working with Java in 96 (the first public beta) and later on moved to VM porting/authoring/internals and development tools. Shai is the co-founder of Codename One, an Open Source project allowing Java developers to build native applications for all mobile platforms in Java. He's the coauthor of the open source LWUIT project from Sun Microsystems and has developed/worked on countless other projects both open source and closed source. Shai is also a developer advocate at Lightrun.
expert thumbnail

JJ Tang

Co-Founder,
Rootly

‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎
expert thumbnail

Sudip Sengupta

Technical Writer,
Javelynn

‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎

The Latest Maintenance Topics

article thumbnail
Best Practices for Writing Clean and Maintainable Code
This article discusses best practices for writing clean and maintainable code in software development.
March 23, 2023
by hussain sabir
· 878 Views · 1 Like
article thumbnail
How Data Scientists Can Follow Quality Assurance Best Practices
Data scientists must follow quality assurance best practices in order to determine accurate findings and influence informed decisions.
March 19, 2023
by Devin Partida
· 2,631 Views · 1 Like
article thumbnail
Software Maintenance Models
Explore the different software maintenance models and learn how they can help you manage your software and keep it up-to-date.
March 16, 2023
by Hiren Dhaduk
· 1,103 Views · 1 Like
article thumbnail
Tools to Track and Manage Technical Debt
It's hard to make the right decision. We will look at the tools to track small, medium, and large pieces of debt and the process to reduce technical debt.
March 15, 2023
by Alex Omeyer
· 10,067 Views · 3 Likes
article thumbnail
All the Cloud’s a Stage and All the WebAssembly Modules Merely Actors
In this post, we’ll take a look at the notion of actors creating more actors and how wasmCloud accomplishes the same goals but without manual supervision tree management.
March 15, 2023
by Kevin Hoffman
· 2,948 Views · 1 Like
article thumbnail
Key Elements of Site Reliability Engineering (SRE)
This article discusses the key elements of SRE—the importance of SRE in improving user experience, system efficiency, scalability, reliability, etc.
March 14, 2023
by Srikarthick Vijayakumar
· 2,513 Views · 5 Likes
article thumbnail
How to Engineer Your Technical Debt Response
Threat engineering provides a proactive, best practice-based approach to breaking down the organizational silos that naturally form around different types of risks.
March 13, 2023
by Jason Bloomberg
· 2,253 Views · 2 Likes
article thumbnail
Developers' Guide: How to Execute Lift and Shift Migration
This article reviews lift and shift migration, how to prepare your application, lift and shift migration strategies, and post migration considerations.
March 11, 2023
by Tejas Kaneriya
· 3,382 Views · 1 Like
article thumbnail
Understanding Technical Debt for Software Teams
What is technical debt? How do you fix it? Stay competitive in the market with the best practices and explore ways to remediate technical debt. Learn more.
March 10, 2023
by Rajiv Srivastava
· 2,641 Views · 1 Like
article thumbnail
Unlock the Power of Terragrunt’s Hierarchy
In this article, readers will learn about Terragrunt (associated with Terraform) and how to unlock the power of Terragrunt’s hierarchy, including code.
March 9, 2023
by lidor ettinger
· 4,463 Views · 1 Like
article thumbnail
Hybrid File Integration on AWS, Technical Debt, and Solution Approach
Some of the key architecture decisions regarding hybrid integration are with reference to FileShare between the cloud and on-prem systems/users.
March 9, 2023
by Sukanta Paul
· 4,751 Views · 1 Like
article thumbnail
What’s New in Flutter 3.7?
Discover the latest updates and enhancements in Flutter 3.7. Stay up-to-date with the latest developments in Flutter technology.
March 9, 2023
by Bernard Maina
· 2,859 Views · 2 Likes
article thumbnail
Green Software and Carbon Hack
If we achieve to write our codes greener, our software projects will be more robust, reliable, faster, and brand resilient.
March 9, 2023
by Beste Bayhan
· 3,254 Views · 1 Like
article thumbnail
SaaS vs PaaS vs IaaS: Which Cloud Service Is Suitable for You
From these three cloud services: SaaS, PaaS, and IaaS, which is the best and suitable option for your application? Let’s figure it out together in this article.
March 8, 2023
by Nanthini .
· 1,563 Views · 1 Like
article thumbnail
Maven Troubleshooting, Unstable Builds, and Open-Source Infrastructure
This is a story about unstable builds and troubleshooting. More importantly, this story is written to thank all contributors to basic software infrastructure — the infrastructure we all use and take for granted.
March 7, 2023
by Jaromir Hamala
· 3,129 Views · 1 Like
article thumbnail
How To Solve Technical Debt: A Guide for Leaders
This article presents effective strategies and best practices for engineering leaders to manage and tackle technical debt head-on.
March 1, 2023
by Alex Omeyer
· 1,765 Views · 2 Likes
article thumbnail
Streaming-First Infrastructure for Real-Time Machine Learning
In this article, we’ll discuss the motives, difficulties, and potential solutions for machine learning’s current state of continual learning.
March 1, 2023
by Shay Bratslavsky
· 2,642 Views · 1 Like
article thumbnail
DevOps for Developers — Introduction and Version Control
Improving our DevOps skills can help us become better developers, teammates, and managers. Learn DevOps principles and a different perspective on Git.
March 1, 2023
by Shai Almog CORE
· 3,068 Views · 3 Likes
article thumbnail
There’s a Better Way To Deploy Code: Let’s Share It
Gatekeeping best practices within the developer community benefits no one. If there's a better way to deploy code, we should share it.
February 28, 2023
by Andrew Backes
· 2,572 Views · 1 Like
article thumbnail
Observability-Driven Development vs Test-Driven Development
This article briefly explains what ODD and TDD means. What the similarities and differences between ODD and TDD are and best practices for implementation.
February 28, 2023
by Hiren Dhaduk
· 3,599 Views · 1 Like
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: