Cloud + data orchestration: Demolish your data silos. Enable complex analytics. Eliminate I/O bottlenecks. Learn the essentials (and more)!
2024 DZone Community Survey: SMEs wanted! Help shape the future of DZone. Share your insights and enter to win swag!
Containers allow applications to run quicker across many different development environments, and a single container encapsulates everything needed to run an application. Container technologies have exploded in popularity in recent years, leading to diverse use cases as well as new and unexpected challenges. This Zone offers insights into how teams can solve these challenges through its coverage of container performance, Kubernetes, testing, container orchestration, microservices usage to build and deploy containers, and more.
You Can Shape Trend Reports: Participate in DZone Original Research + Enter the Prize Drawings!
Leveraging Test Containers With Docker for Efficient Unit Testing
While debugging in an IDE or using simple command line tools is relatively straightforward, the real challenge lies in production debugging. Modern production environments have enabled sophisticated self-healing deployments, yet they have also made troubleshooting more complex. Kubernetes (aka k8s) is probably the most well-known orchestration production environment. To effectively teach debugging in Kubernetes, it's essential to first introduce its fundamental principles. This part of the debugging series is designed for developers looking to effectively tackle application issues within Kubernetes environments, without delving deeply into the complex DevOps aspects typically associated with its operations. Kubernetes is a big subject: it took me two videos just to explain the basic concepts and background. Introduction to Kubernetes and Distributed Systems Kubernetes, while often discussed in the context of cloud computing and large-scale operations, is not just a tool for managing containers. Its principles apply broadly to all large-scale distributed systems. In this post I want to explore Kubernetes from the ground up, emphasizing its role in solving real-world problems faced by developers in production environments. The Evolution of Deployment Technologies Before Kubernetes, the deployment landscape was markedly different. Understanding this evolution helps us appreciate the challenges Kubernetes aims to solve. The image below represents the road to Kubernetes and the technologies we passed along the way. In the image, we can see that initially, applications were deployed directly onto physical servers. This process was manual, error-prone, and difficult to replicate across multiple environments. For instance, if a company needed to scale its application, it involved procuring new hardware, installing operating systems, and configuring the application from scratch. This could take weeks or even months, leading to significant downtime and operational inefficiencies. Imagine a retail company preparing for the holiday season surge. Each time they needed to handle increased traffic, they would manually set up additional servers. This was not only time-consuming but also prone to human error. Scaling down after the peak period was equally cumbersome, leading to wasted resources. Enter Virtualization Virtualization technology introduced a layer that emulated the hardware, allowing for easier replication and migration of environments but at the cost of performance. However, fast virtualization enabled the cloud revolution. It lets companies like Amazon lease their servers at scale without compromising their own workloads. Virtualization involves running multiple operating systems on a single physical hardware host. Each virtual machine (VM) includes a full copy of an operating system, the application, necessary binaries, and libraries—taking up tens of GBs. VMs are managed via a hypervisor, such as VMware's ESXi or Microsoft's Hyper-V, which sits between the hardware and the operating system and is responsible for distributing hardware resources among the VMs. This layer adds additional overhead and can lead to decreased performance due to the need to emulate hardware. Note that virtualization is often referred to as "virtual machines," but I chose to avoid that terminology due to the focus of this blog on Java and the JVM where a virtual machine is typically a reference to the Java Virtual Machine (JVM). Rise of Containers Containers emerged as a lightweight alternative to full virtualization. Tools like Docker standardized container formats, making it easier to create and manage containers without the overhead associated with traditional virtual machines. Containers encapsulate an application’s runtime environment, making them portable and efficient. Unlike virtualization, containerization encapsulates an application in a container with its own operating environment, but it shares the host system’s kernel with other containers. Containers are thus much more lightweight, as they do not require a full OS instance; instead, they include only the application and its dependencies, such as libraries and binaries. This setup reduces the size of each container and improves boot times and performance by removing the hypervisor layer. Containers operate using several key Linux kernel features: Namespaces: Containers use namespaces to provide isolation for global system resources between independent containers. This includes aspects of the system like process IDs, networking interfaces, and file system mounts. Each container has its own isolated namespace, which gives it a private view of the operating system with access only to its resources. Control groups (cgroups): Cgroups further enhance the functionality of containers by limiting and prioritizing the hardware resources a container can use. This includes parameters such as CPU time, system memory, network bandwidth, or combinations of these resources. By controlling resource allocation, cgroups ensure that containers do not interfere with each other’s performance and maintain the efficiency of the underlying server. Union file systems: Containers use union file systems, such as OverlayFS, to layer files and directories in a lightweight and efficient manner. This system allows containers to appear as though they are running on their own operating system and file system, while they are actually sharing the host system’s kernel and base OS image. Rise of Orchestration As containers began to replace virtualization due to their efficiency and speed, developers and organizations rapidly adopted them for a wide range of applications. However, this surge in container usage brought with it a new set of challenges, primarily related to managing large numbers of containers at scale. While containers are incredibly efficient and portable, they introduce complexities when used extensively, particularly in large-scale, dynamic environments: Management overhead: Manually managing hundreds or even thousands of containers quickly becomes unfeasible. This includes deployment, networking, scaling, and ensuring availability and security. Resource allocation: Containers must be efficiently scheduled and managed to optimally use physical resources, avoiding underutilization or overloading of host machines. Service discovery and load balancing: As the number of containers grows, keeping track of which container offers which service and how to balance the load between them becomes critical. Updates and rollbacks: Implementing rolling updates, managing version control, and handling rollbacks in a containerized environment require robust automation tools. To address these challenges, the concept of container orchestration was developed. Orchestration automates the scheduling, deployment, scaling, networking, and lifecycle management of containers, which are often organized into microservices. Efficient orchestration tools help ensure that the entire container ecosystem is healthy and that applications are running as expected. Enter Kubernetes Among the orchestration tools, Kubernetes emerged as a frontrunner due to its robust capabilities, flexibility, and strong community support. Kubernetes offers several features that address the core challenges of managing containers: Automated scheduling: Kubernetes intelligently schedules containers on the cluster’s nodes, taking into account the resource requirements and other constraints, optimizing for efficiency and fault tolerance. Self-healing capabilities: It automatically replaces or restarts containers that fail, ensuring high availability of services. Horizontal scaling: Kubernetes can automatically scale applications up and down based on demand, which is essential for handling varying loads efficiently. Service discovery and load balancing: Kubernetes can expose a container using the DNS name or using its own IP address. If traffic to a container is high, Kubernetes is able to load balance and distribute the network traffic so that the deployment is stable. Automated rollouts and rollbacks: Kubernetes allows you to describe the desired state for your deployed containers using declarative configuration, and can change the actual state to the desired state at a controlled rate, such as to roll out a new version of an application. Why Kubernetes Stands Out Kubernetes not only solves practical, operational problems associated with running containers but also integrates with the broader technology ecosystem, supporting continuous integration and continuous deployment (CI/CD) practices. It is backed by the Cloud Native Computing Foundation (CNCF), ensuring it remains cutting-edge and community-focused. There used to be a site called "doyouneedkubernetes.com," and when you visited that site, it said, "No." Most of us don't need Kubernetes and it is often a symptom of Resume Driven Design (RDD). However, even when we don't need its scaling capabilities the advantages of its standardization are tremendous. Kubernetes became the de-facto standard and created a cottage industry of tools around it. Features such as observability and security can be plugged in easily. Cloud migration becomes arguably easier. Kubernetes is now the "lingua franca" of production environments. Kubernetes For Developers Understanding Kubernetes architecture is crucial for debugging and troubleshooting. The following image shows the high-level view of a Kubernetes deployment. There are far more details in most tutorials geared towards DevOps engineers, but for a developer, the point that matters is just "Your Code" - that tiny corner at the edge. In the image above we can see: Master node (represented by the blue Kubernetes logo on the left): The control plane of Kubernetes, responsible for managing the state of the cluster, scheduling applications, and handling replication Worker nodes: These nodes contain the pods that run the containerized applications. Each worker node is managed by the master. Pods: The smallest deployable units created and managed by Kubernetes, usually containing one or more containers that need to work together These components work together to ensure that an application runs smoothly and efficiently across the cluster. Kubernetes Basics In Practice Up until now, this post has been theory-heavy. Let's now review some commands we can use to work with a Kubernetes cluster. First, we would want to list the pods we have within the cluster which we can do using the get pods command as such: $ kubectl get pods NAME READY STATUS RESTARTS AGE my-first-pod-id-xxxx 1/1 Running 0 13s my-second-pod-id-xxxx 1/1 Running 0 13s A command such as kubectl describe pod returns a high-level description of the pod such as its name, parent node, etc. Many problems in production pods can be solved by looking at the system log. This can be accomplished by invoking the logs command: $ kubectl logs -f <pod> [2022-11-29 04:12:17,262] INFO log data ... Most typical large-scale application logs are ingested by tools such as Elastic, Loki, etc. As such, the logs command isn't as useful in production except for debugging edge cases. Final Word This introduction to Kubernetes has set the stage for deeper exploration into specific debugging and troubleshooting techniques, which we will cover in the upcoming posts. The complexity of Kubernetes makes it much harder to debug, but there are facilities in place to work around some of that complexity. While this article (and its follow-ups) focus on Kubernetes, future posts will delve into observability and related tools, which are crucial for effective debugging in production environments.
Over the years Docker containers have completely changed how developers create, share, and run applications. With their flexible design, Docker containers ensure an environment, across various platforms simplifying the process of deploying applications reliably. When integrated with .NET, developers can harness Dockers capabilities to streamline the development and deployment phases of .NET applications. This article delves into the advantages of using Docker containers with .NET applications and offers a guide on getting started. Figure courtesy of Docker Why Choose Docker for .NET? 1. Consistent Development Environment Docker containers encapsulate all dependencies and configurations for running an application guaranteeing consistency across development, testing, and production environments. By leveraging Docker, developers can avoid the typical statement of "it works on my machine" issue, as they can create environments that operate flawlessly across various development teams and devices. 2. Simplified Dependency Management Docker eliminates the need to manually install and manage dependencies on developer machines. By specifying dependencies in a Docker file developers can effortlessly bundle their .NET applications with libraries and dependencies reducing setup time and minimizing compatibility issues. 3. Scalability and Resource Efficiency Due to its nature and containerization technology, Docker is well suited for horizontally or vertically scaling .NET applications. Developers have the ability to easily set up instances of their applications using Docker Swarm or Kubernetes which helps optimize resource usage and enhance application performance. 4. Simplified Deployment Process Docker simplifies the deployment of .NET applications. Developers have the ability to wrap their applications into Docker images. These can be deployed to any Docker-compatible environment, including local servers, cloud platforms like AWS or Azure, and even IOT devices. This not only streamlines the deployment process but also accelerates the release cycle of .NET applications Starting With Docker and .NET Step 1: Installing Docker Installing Docker is easy by navigating to the Docker desktop. Docker desktop is available for Windows, Mac, and Linux. I have downloaded and installed it for Windows. Once installed, the Docker (whale) icon is shown on the systems side tray as shown below. When you click on the icon, it will open the Docker desktop dashboard as shown below. You can see the list of containers, images, volumes, builds, and extensions. In the below figure, it shows the list of containers I have created on my local machine. Step 2: Creating a .NET Application Create a .NET application using the tool of your choice like Visual Studio, Visual Studio Code, or the.NET CLI. For example, you can use the following command directly from the command line. PowerShell dotnet new web -n MinimalApiDemo Step 3: Setting up Your Application With a Docker Create a Dockerfile in the root folder of your .NET project to specify the Docker image for your application. Below is an example of a Dockerfile for an ASP.NET Core application which was created in the previous step. Dockerfile # Use the official ASP.NET Core runtime as a base image FROM mcr.microsoft.com/dotnet/aspnet:8.0 AS base WORKDIR /app EXPOSE 8080 # Use the official SDK image to build the application FROM mcr.microsoft.com/dotnet/sdk:8.0 AS build WORKDIR /src COPY ["MinimalApiDemo.csproj", "./"] RUN dotnet restore "MinimalApiDemo.csproj" COPY . . WORKDIR "/src/" RUN dotnet build "MinimalApiDemo.csproj" -c Release -o /app/build # Publish the application FROM build AS publish RUN dotnet publish "MinimalApiDemo.csproj" -c Release -o /app/publish # Final image with only the published application FROM base AS final WORKDIR /app COPY --from=publish /app/publish . ENTRYPOINT ["dotnet", "MinimalApiDemo.dll"] Step 4: Creating and Launching Your Docker Image Create a Docker image by executing the command from a terminal window (use lowercase letters). PowerShell docker build -t minimalapidemo . After finishing the construction process you are ready to start up your Docker image by running it inside a container. Run the below docker command to spin up a new container. PowerShell docker run -d -p 8080:8080 --name myminimalapidemo minimalapidemo Your API service is currently running within a Docker container and can be reached at this localhost as shown below. Refer to my previous article to see how I created products controllers using Minimal API's with different HTTP endpoints. Here Are Some Recommended Strategies for Dockerizing .NET Applications 1. Reduce Image Size Enhance the efficiency of your Docker images by utilizing stage builds eliminating dependencies and minimizing layers in your Docker file. 2. Utilize .dockerignore File Generate a .dockerignore file to exclude files and directories from being transferred into the Docker image thereby decreasing image size and enhancing build speed. 3. Ensure Container Security Adhere to security practices during the creation and operation of Docker containers including updating base images conducting vulnerability scans and restricting container privileges. 4. Employ Docker Compose for Multi Container Applications For applications with services or dependencies, leverage Docker Compose to define and manage multi-container applications simplifying both development and deployment processes. 5. Monitor and Troubleshoot Containers Monitor the performance and health of your Docker containers using Docker’s own monitoring tools or third-party solutions. Make use of tools such as Docker logs and debugging utilities to promptly resolve issues and boost the efficiency of your containers. Conclusion Docker containers offer an efficient platform for the development, packaging, and deployment of .NET applications. By containerizing these applications, developers can create development environments, simplify dependency management, and streamline deployment processes. Whether the focus is on microservices, web apps, or APIs, Docker provides a proficient method to operate .NET applications across various environments. By adhering to best practices and maximizing Docker’s capabilities, developers can fully leverage the benefits of containerization, thereby accelerating the process of constructing and deploying .NET applications
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. Simplicity is a key selling point of cloud technology. Rather than worrying about racking and stacking equipment, configuring networks, and installing operating systems, developers can just click through a friendly web interface and quickly deploy an application. Of course, that friendly web interface hides serious complexity, and deploying an application is just the first and easiest step toward a performant and reliable system. Once an application grows beyond a single deployment, issues begin to creep in. New versions require database schema changes or added components, and multiple team members can change configurations. The application must also be scaled to serve more users, provide redundancy to ensure reliability, and manage backups to protect data. While it might be possible to manage this complexity using that friendly web interface, we need automated cloud orchestration to deliver consistently at speed. There are many choices for cloud orchestration, so which one is best for a particular application? Let's use a case study to consider two key decisions in the trade space: The number of different technologies we must learn and manage Our ability to migrate to a different cloud environment with minimal changes to the automation However, before we look at the case study, let's start by understanding some must-have features of any cloud automation. Cloud Orchestration Must-Haves Our goal with cloud orchestration automation is to manage the complexity of deploying and operating a cloud-native application. We want to be confident that we understand how our application is configured, that we can quickly restore an application after outages, and that we can manage changes over time with confidence in bug fixes and new capabilities while avoiding unscheduled downtime. Repeatability and Idempotence Cloud-native applications use many cloud resources, each with different configuration options. Problems with infrastructure or applications can leave resources in an unknown state. Even worse, our automation might fail due to network or configuration issues. We need to run our automation confidently, even when cloud resources are in an unknown state. This key property is called idempotence, which simplifies our workflow as we can run the automation no matter the current system state and be confident that successful completion places the system in the desired state. Idempotence is typically accomplished by having the automation check the current state of each resource, including its configuration parameters, and applying only necessary changes. This kind of smart resource application demands dedicated orchestration technology rather than simple scripting. Change Tracking and Control Automation needs to change over time as we respond to changes in application design or scaling needs. As needs change, we must manage automation changes as dueling versions will defeat the purpose of idempotence. This means we need Infrastructure as Code (IaC), where cloud orchestration automation is managed identically to other developed software, including change tracking and version management, typically in a Git repository such as this example. Change tracking helps us identify the source of issues sooner by knowing what changes have been made. For this reason, we should modify our cloud environments only by automation, never manually, so we can know that the repository matches the system state — and so we can ensure changes are reviewed, understood, and tested prior to deployment. Multiple Environment Support To test automation prior to production deployment, we need our tooling to support multiple environments. Ideally, we can support rapid creation and destruction of dynamic test environments because this increases confidence that there are no lingering required manual configurations and enables us to test our automation by using it. Even better, dynamic environments allow us to easily test changes to the deployed application, creating unique environments for developers, complex changes, or staging purposes prior to production. Cloud automation accomplishes multi-environment support through variables or parameters passed from a configuration file, environment variables, or on the command line. Managed Rollout Together, idempotent orchestration, a Git repository, and rapid deployment of dynamic environments bring the concept of dynamic environments to production, enabling managed rollouts for new application versions. There are multiple managed rollout techniques, including blue-green deployments and canary deployments. What they have in common is that a rollout consists of separately deploying the new version, transitioning users over to the new version either at once or incrementally, then removing the old version. Managed rollouts can eliminate application downtime when moving to new versions, and they enable rapid detection of problems coupled with automated fallback to a known working version. However, a managed rollout is complicated to implement as not all cloud resources support it natively, and changes to application architecture and design are typically required. Case Study: Implementing Cloud Automation Let's explore the key features of cloud automation in the context of a simple application. We'll deploy the same application using both a cloud-agnostic approach and a single-cloud approach to illustrate how both solutions provide the necessary features of cloud automation, but with differences in implementation and various advantages and disadvantages. Our simple application is based on Node, backed by a PostgreSQL database, and provides an interface to create, retrieve, update, and delete a list of to-do items. The full deployment solutions can be seen in this repository. Before we look at differences between the two deployments, it's worth considering what they have in common: Use a Git repository for change control of the IaC configuration Are designed for idempotent execution, so both have a simple "run the automation" workflow Allow for configuration parameters (e.g., cloud region data, unique names) that can be used to adapt the same automation to multiple environments Cloud-Agnostic Solution Our first deployment, as illustrated in Figure 1, uses Terraform (or OpenTofu) to deploy a Kubernetes cluster into a cloud environment. Terraform then deploys a Helm chart, with both the application and PostgreSQL database. Figure 1. Cloud-agnostic deployment automation The primary advantage of this approach, as seen in the figure, is that the same deployment architecture is used to deploy to both Amazon Web Services (AWS) and Microsoft Azure. The container images and Helm chart are identical in both cases, and the Terraform workflow and syntax are also identical. Additionally, we can test container images, Kubernetes deployments, and Helm charts separately from the Terraform configuration that creates the Kubernetes environment, making it easy to reuse much of this automation to test changes to our application. Finally, with Terraform and Kubernetes, we're working at a high level of abstraction, so our automation code is short but can still take advantage of the reliability and scalability capabilities built into Kubernetes. For example, an entire Azure Kubernetes Service (AKS) cluster is created in about 50 lines of Terraform configuration via the azurerm_kubernetes_cluster resource: Shell resource "azurerm_kubernetes_cluster" "k8s" { location = azurerm_resource_group.rg.location name = random_pet.azurerm_kubernetes_cluster_name.id ... default_node_pool { name = "agentpool" vm_size = "Standard_D2_v2" node_count = var.node_count } ... network_profile { network_plugin = "kubenet" load_balancer_sku = "standard" } } Even better, the Helm chart deployment is just five lines and is identical for AWS and Azure: Shell resource "helm_release" "todo" { name = "todo" repository = "https://book-of-kubernetes.github.io/helm/" chart = "todo" } However, a cloud-agnostic approach brings additional complexity. First, we must create and maintain configuration using multiple tools, requiring us to understand Terraform syntax, Kubernetes manifest YAML files, and Helm templates. Also, while the overall Terraform workflow is the same, the cloud provider configuration is different due to differences in Kubernetes cluster configuration and authentication. This means that adding a third cloud provider would require significant effort. Finally, if we wanted to use additional features such as cloud-native databases, we'd first need to understand the key configuration details of that cloud provider's database, then understand how to apply that configuration using Terraform. This means that we pay an additional price in complexity for each native cloud capability we use. Single Cloud Solution Our second deployment, illustrated in Figure 2, uses AWS CloudFormation to deploy an Elastic Compute Cloud (EC2) virtual machine and a Relational Database Service (RDS) cluster: Figure 2. Single cloud deployment automation The biggest advantage of this approach is that we create a complete application deployment solution entirely in CloudFormation's YAML syntax. By using CloudFormation, we are working directly with AWS cloud resources, so there's a clear correspondence between resources in the AWS web console and our automation. As a result, we can take advantage of the specific cloud resources that are best suited for our application, such as RDS for our PostgreSQL database. This use of the best resources for our application can help us manage our application's scalability and reliability needs while also managing our cloud spend. The tradeoff in exchange for this simplicity and clarity is a more verbose configuration. We're working at the level of specific cloud resources, so we have to specify each resource, including items such as routing tables and subnets that Terraform configures automatically. The resulting CloudFormation YAML is 275 lines and includes low-level details such as egress routing from our VPC to the internet: Shell TodoInternetRoute: Type: AWS::EC2::Route Properties: DestinationCidrBlock: 0.0.0.0/0 GatewayId: !Ref TodoInternetGateway RouteTableId: !Ref TodoRouteTable Also, of course, the resources and configuration are AWS-specific, so if we wanted to adapt this automation to a different cloud environment, we would need to rewrite it from the ground up. Finally, while we can easily adapt this automation to create multiple deployments on AWS, it is not as flexible for testing changes to the application as we have to deploy a full RDS cluster for each new instance. Conclusion Our case study enabled us to exhibit key features and tradeoffs for cloud orchestration automation. There are many more than just these two options, but whatever solution is chosen should use an IaC repository for change control and a tool for idempotence and support for multiple environments. Within that cloud orchestration space, our deployment architecture and our tool selection will be driven by the importance of portability to new cloud environments compared to the cost in additional complexity. This is an excerpt from DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report
Are you ready to get started with cloud-native observability with telemetry pipelines? This article is part of a series exploring a workshop guiding you through the open source project Fluent Bit, what it is, a basic installation, and setting up the first telemetry pipeline project. Learn how to manage your cloud-native data from source to destination using the telemetry pipeline phases covering collection, aggregation, transformation, and forwarding from any source to any destination. In the previous article in this series, we explored what backpressure was, how it manifests in telemetry pipelines, and took the first steps to mitigate this with Fluent Bit. In this article, we look at how to enable Fluent Bit features that will help with avoiding telemetry data loss as we saw in the previous article. You can find more details in the accompanying workshop lab. Before we get started it's important to review the phases of a telemetry pipeline. In the diagram below we see them laid out again. Each incoming event goes from input to parser to filter to buffer to routing before they are sent to its final output destination(s). For clarity in this article, we'll split up the configuration into files that are imported into a main fluent bit configuration file we'll name workshop-fb.conf. Tackling Data Loss Previously, we explored how input plugins can hit their ingestion limits when our telemetry pipelines scale beyond memory limits when using default in-memory buffering of our events. We also saw that we can limit the size of our input plugin buffers to prevent our pipeline from failing on out-of-memory errors, but that the pausing of the ingestion can also lead to data loss if the clearing of the input buffers takes too long. To rectify this problem, we'll explore another buffering solution that Fluent Bit offers, ensuring data and memory safety at scale by configuring filesystem buffering. To that end, let's explore how the Fluent Bit engine processes data that input plugins emit. When an input plugin emits events, the engine groups them into a Chunk. The chunk size is around 2MB. The default is for the engine to place this Chunk only in memory. We saw that limiting in-memory buffer size did not solve the problem, so we are looking at modifying this default behavior of only placing chunks into memory. This is done by changing the property storage.type from the default Memory to Filesystem. It's important to understand that memory and filesystem buffering mechanisms are not mutually exclusive. By enabling filesystem buffering for our input plugin we automatically get performance and data safety Filesystem Buffering Tips When changing our buffering from memory to filesystem with the property storage.type filesystem, the settings for mem_buf_limit are ignored. Instead, we need to use the property storage.max_chunks_up for controlling the size of our memory buffer. Shockingly, when using the default settings the property storage.pause_on_chunks_overlimit is set to off, causing the input plugins not to pause. Instead, input plugins will switch to buffering only in the filesystem. We can control the amount of disk space used with storage.total_limit_size. If the property storage.pause_on_chunks_overlimit is set to on, then the buffering mechanism to the filesystem behaves just like our mem_buf_limit scenario demonstrated previously. Configuring Stressed Telemetry Pipeline In this example, we are going to use the same stressed Fluent Bit pipeline to simulate a need for enabling filesystem buffering. All examples are going to be shown using containers (Podman) and it's assumed you are familiar with container tooling such as Podman or Docker. We begin the configuration of our telemetry pipeline in the INPUT phase with a simple dummy plugin generating a large number of entries to flood our pipeline with as follows in our configuration file inputs.conf (note that the mem_buf_limit fix is commented out): # This entry generates a large amount of success messages for the workshop. [INPUT] Name dummy Tag big.data Copies 15000 Dummy {"message":"true 200 success", "big_data": "blah blah blah blah blah blah blah blah blah"} #Mem_Buf_Limit 2MB Now ensure the output configuration file outputs.conf has the following configuration: # This entry directs all tags (it matches any we encounter) # to print to standard output, which is our console. [OUTPUT] Name stdout Match * With our inputs and outputs configured, we can now bring them together in a single main configuration file. Using a file called workshop-fb.conf in our favorite editor, ensure the following configuration is created. For now, just import two files: # Fluent Bit main configuration file. # # Imports section. @INCLUDE inputs.conf @INCLUDE outputs.conf Let's now try testing our configuration by running it using a container image. The first thing that is needed is to ensure a file called Buildfile is created. This is going to be used to build a new container image and insert our configuration files. Note this file needs to be in the same directory as our configuration files, otherwise adjust the file path names: FROM cr.fluentbit.io/fluent/fluent-bit:3.0.4 COPY ./workshop-fb.conf /fluent-bit/etc/fluent-bit.conf COPY ./inputs.conf /fluent-bit/etc/inputs.conf COPY ./outputs.conf /fluent-bit/etc/outputs.conf Now we'll build a new container image, naming it with a version tag as follows using the Buildfile and assuming you are in the same directory: $ podman build -t workshop-fb:v8 -f Buildfile STEP 1/4: FROM cr.fluentbit.io/fluent/fluent-bit:3.0.4 STEP 2/4: COPY ./workshop-fb.conf /fluent-bit/etc/fluent-bit.conf --> a379e7611210 STEP 3/4: COPY ./inputs.conf /fluent-bit/etc/inputs.conf --> f39b10d3d6d0 STEP 4/4: COPY ./outputs.conf /fluent-bit/etc/outputs.conf COMMIT workshop-fb:v6 --> e74b2f228729 Successfully tagged localhost/workshop-fb:v8 e74b2f22872958a79c0e056efce66a811c93f43da641a2efaa30cacceb94a195 If we run our pipeline in a container configured with constricted memory, in our case, we need to give it around a 6.5MB limit, then we'll see the pipeline run for a bit and then fail due to overloading (OOM): $ podman run --memory 6.5MB --name fbv8 workshop-fb:v8 The console output shows that the pipeline ran for a bit; in our case, below to event number 862 before it hit the OOM limits of our container environment (6.5MB): ... [860] big.data: [[1716551898.202389716, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}] [861] big.data: [[1716551898.202389925, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}] [862] big.data: [[1716551898.202390133, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}] [863] big.data: [[1 <<<< CONTAINER KILLED WITH OOM HERE We can validate that the stressed telemetry pipeline actually failed on an OOM status by viewing our container, and inspecting it for an OOM failure to validate our backpressure worked: # Use the container name to inspect for reason it failed $ podman inspect fbv8 | grep OOM "OOMKilled": true, Already having tried in a previous lab to manage this with mem_buf_limit settings, we've seen that this also is not the real fix. To prevent data loss we need to enable filesystem buffering so that overloading the memory buffer means that events will be buffered in the filesystem until there is memory free to process them. Using Filesystem Buffering The configuration of our telemetry pipeline in the INPUT phase needs a slight adjustment by adding storage.type to as shown, set to filesystem to enable it. Note that mem_buf_limit has been removed: # This entry generates a large amount of success messages for the workshop. [INPUT] Name dummy Tag big.data Copies 15000 Dummy {"message":"true 200 success", "big_data": "blah blah blah blah blah blah blah blah blah"} storage.type filesystem We can now bring it all together in the main configuration file. Using a file called the following workshop-fb.conf in our favorite editor, update the file to include SERVICE configuration is added with settings for managing the filesystem buffering: # Fluent Bit main configuration file. [SERVICE] flush 1 log_Level info storage.path /tmp/fluentbit-storage storage.sync normal storage.checksum off storage.max_chunks_up 5 # Imports section @INCLUDE inputs.conf @INCLUDE outputs.conf A few words on the SERVICE section properties might be needed to explain their function: storage.path - Putting filesystem buffering in the tmp filesystem storage.sync- Using normal and turning off checksum processing storage.max_chunks_up - Set to ~10MB, amount of allowed memory for events Now it's time for testing our configuration by running it using a container image. The first thing that is needed is to ensure a file called Buildfile is created. This is going to be used to build a new container image and insert our configuration files. Note this file needs to be in the same directory as our configuration files, otherwise adjust the file path names: FROM cr.fluentbit.io/fluent/fluent-bit:3.0.4 COPY ./workshop-fb.conf /fluent-bit/etc/fluent-bit.conf COPY ./inputs.conf /fluent-bit/etc/inputs.conf COPY ./outputs.conf /fluent-bit/etc/outputs.conf Now we'll build a new container image, naming it with a version tag, as follows using the Buildfile and assuming you are in the same directory: $ podman build -t workshop-fb:v9 -f Buildfile STEP 1/4: FROM cr.fluentbit.io/fluent/fluent-bit:3.0.4 STEP 2/4: COPY ./workshop-fb.conf /fluent-bit/etc/fluent-bit.conf --> a379e7611210 STEP 3/4: COPY ./inputs.conf /fluent-bit/etc/inputs.conf --> f39b10d3d6d0 STEP 4/4: COPY ./outputs.conf /fluent-bit/etc/outputs.conf COMMIT workshop-fb:v6 --> e74b2f228729 Successfully tagged localhost/workshop-fb:v9 e74b2f22872958a79c0e056efce66a811c93f43da641a2efaa30cacceb94a195 If we run our pipeline in a container configured with constricted memory (slightly larger value due to memory needed for mounting the filesystem) - in our case, we need to give it around a 9MB limit - then we'll see the pipeline running without failure: $ podman run -v ./:/tmp --memory 9MB --name fbv9 workshop-fb:v9 The console output shows that the pipeline runs until we stop it with CTRL-C, with events rolling by as shown below. ... [14991] big.data: [[1716559655.213181639, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah"}] [14992] big.data: [[1716559655.213182181, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah"}] [14993] big.data: [[1716559655.213182681, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah"}] ... We can now validate the filesystem buffering by looking at the filesystem storage. Check the filesystem from the directory where you started your container. While the pipeline is running with memory restrictions, it will be using the filesystem to store events until the memory is free to process them. If you view the contents of the file before stopping your pipeline, you'll see a messy message format stored inside (cleaned up for you here): $ ls -l ./fluentbit-storage/dummy.0/1-1716558042.211576161.flb -rw------- 1 username groupname 1.4M May 24 15:40 1-1716558042.211576161.flb $ cat fluentbit-storage/dummy.0/1-1716558042.211576161.flb ??wbig.data???fP?? ?????message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP?? ?p???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP?? ߲???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP?? ?F???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP?? ?d???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP?? ... Last Thoughts on Filesystem Buffering This solution is the way to deal with backpressure and other issues that might flood your telemetry pipeline and cause it to crash. It's worth noting that using a filesystem to buffer the events also introduces the limits of the filesystem being used. It's important to understand that just as memory can run out, so too can the filesystem storage reach its limits. It's best to have a plan to address any possible filesystem challenges when using this solution, but this is outside the scope of this article. This completes our use cases for this article. Be sure to explore this hands-on experience with the accompanying workshop lab. What's Next? This article walked us through how Fluent Bit filesystem buffering provides a data- and memory-safe solution to the problems of backpressure and data loss. Stay tuned for more hands-on material to help you with your cloud-native observability journey.
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. When it comes to software engineering and application development, cloud native has become commonplace in many teams' vernacular. When people survey the world of cloud native, they often come away with the perspective that the entire process of cloud native is for the large enterprise applications. A few years ago, that may have been the case, but with the advancement of tooling and services surrounding systems such as Kubernetes, the barrier to entry has been substantially lowered. Even so, does adopting cloud-native practices for applications consisting of a few microservices make a difference? Just as cloud native has become commonplace, the shift-left movement has made inroads into many organizations' processes. Shifting left is a focus on application delivery from the outset of a project, where software engineers are just as focused on the delivery process as they are on writing application code. Shifting left implies that software engineers understand deployment patterns and technologies as well as implement them earlier in the SDLC. Shifting left using cloud native with microservices development may sound like a definition containing a string of contemporary buzzwords, but there's real benefit to be gained in combining these closely related topics. Fostering a Deployment-First Culture Process is necessary within any organization. Processes are broken down into manageable tasks across multiple teams with the objective being an efficient path by which an organization sets out to reach a goal. Unfortunately, organizations can get lost in their processes. Teams and individuals focus on doing their tasks as best as possible, and at times, so much so that the goal for which the process is defined gets lost. Software development lifecycle (SDLC) processes are not immune to this problem. Teams and individuals focus on doing their tasks as best as possible. However, in any given organization, if individuals on application development teams are asked how they perceive their objectives, responses can include: "Completing stories" "Staying up to date on recent tech stack updates" "Ensuring their components meet security standards" "Writing thorough tests" Most of the answers provided would demonstrate a commitment to the process, which is good. However, what is the goal? The goal of the SDLC is to build software and deploy it. Whether it be an internal or SaaS application, deploying software helps an organization meet an objective. When presented with the statement that the goal of the SDLC is to deliver and deploy software, just about anyone who participates in the process would say, "Well, of course it is." Teams often lose sight of this "obvious" directive because they're far removed from the actual deployment process. A strategic investment in the process can close that gap. Cloud-native abstractions bring a common domain and dialogue across disciplines within the SDLC. Kubernetes is a good basis upon which cloud-native abstractions can be leveraged. Not only does Kubernetes' usefulness span applications of many shapes and sizes, but when it comes to the SDLC, Kubernetes can also be the environment used on systems ranging from local engineering workstations, though the entire delivery cycle, and on to production. Bringing the deployment platform all the way "left" to an engineer's workstation has everyone in the process speaking the same language, and deployment becomes a focus from the beginning of the process. Various teams in the SDLC may look at "Kubernetes Everywhere" with skepticism. Work done on Kubernetes in reducing its footprint for systems such as edge devices has made running Kubernetes on a workstation very manageable. Introducing teams to Kubernetes through automation allows them to iteratively absorb the platform. The most important thing is building a deployment-first culture. Plan for Your Deployment Artifacts With all teams and individuals focused on the goal of getting their applications to production as efficiently and effectively as possible, how does the evolution of application development shift? The shift is subtle. With a shift-left mindset, there aren't necessarily a lot of new tasks, so the shift is where the tasks take place within the overall process. When a detailed discussion of application deployment begins with the first line of code, existing processes may need to be updated. Build Process If software engineers are to deploy to their personal Kubernetes clusters, are they able to build and deploy enough of an application that they're not reliant on code running on a system beyond their workstation? And there is more to consider than just application code. Is a database required? Does the application use a caching system? It can be challenging to review an existing build process and refactor it for workstation use. The CI/CD build process may need to be re-examined to consider how it can be invoked on a workstation. For most applications, refactoring the build process can be accomplished in such a way that the goal of local build and deployment is met while also using the refactored process in the existing CI/CD pipeline. For new projects, begin by designing the build process for the workstation. The build process can then be added to a CI/CD pipeline. The local build and CI/CD build processes should strive to share as much code as possible. This will keep the entire team up to date on how the application is built and deployed. Build Artifacts The primary deliverables for a build process are the build artifacts. For cloud-native applications, this includes container images (e.g., Docker images) and deployment packages (e.g., Helm charts). When an engineer is executing the build process on their workstation, the artifacts will likely need to be published to a repository, such as a container registry or chart repository. The build process must be aware of context. Existing processes may already be aware of their context with various settings for environments ranging from test and staging to production. Workstation builds become an additional context. Given the awareness of context, build processes can publish artifacts to workstation-specific registries and repositories. For cloud-native development, and in keeping with the local workstation paradigm, container registries and chart repositories are deployed as part of the workstation Kubernetes cluster. As the process moves from build to deploy, maintaining build context includes accessing resources within the current context. Parameterization Central to this entire process is that key components of the build and deployment process definition cannot be duplicated based on a runtime environment. For example, if a container image is built and published one way on the local workstation and another way in the CI/CD pipeline. How long will it be before they diverge? Most likely, they diverge sooner than expected. Divergence in a build process will create a divergence across environments, which leads to divergence in teams and results in the eroding of the deployment-first culture. That may sound a bit dramatic, but as soon as any code forks — without a deliberate plan to merge the forks — the code eventually becomes, for all intents and purposes, unmergeable. Parameterizing the build and deployment process is required to maintain a single set of build and deployment components. Parameters define build context such as the registries and repositories to use. Parameters define deployment context as well, such as the number of pod replicas to deploy or resource constraints. As the process is created, lean toward over-parameterization. It's easier to maintain a parameter as a constant rather than extract a parameter from an existing process. Figure 1. Local development cluster Cloud-Native Microservices Development in Action In addition to the deployment-first culture, cloud-native microservices development requires tooling support that doesn't impede the day-to-day tasks performed by an engineer. If engineers can be shown a new pattern for development that allows them to be more productive with only a minimum-to-moderate level of understanding of new concepts, while still using their favorite tools, the engineers will embrace the paradigm. While engineers may push back or be skeptical about a new process, once the impact on their productivity is tangible, they will be energized to adopt the new pattern. Easing Development Teams Into the Process Changing culture is about getting teams on board with adopting a new way of doing something. The next step is execution. Shifting left requires that software engineers move from designing and writing application code to becoming an integral part of the design and implementation of the entire build and deployment process. This means learning new tools and exploring areas in which they may not have a great deal of experience. Human nature tends to resist change. Software engineers may look at this entire process and think, "How can I absorb this new process and these new tools while trying to maintain a schedule?" It's a valid question. However, software engineers are typically fine with incorporating a new development tool or process that helps them and the team without drastically disrupting their daily routine. Whether beginning a new project or refactoring an existing one, adoption of a shift-left engineering process requires introducing new tools in a way that allows software engineers to remain productive while iteratively learning the new tooling. This starts with automating and documenting the build out of their new development environment — their local Kubernetes cluster. It also requires listening to the team's concerns and suggestions as this will be their daily environment. Dev(elopment) Containers The Development Containers specification is a relatively new advancement based on an existing concept in supporting development environments. Many engineering teams have leveraged virtual desktop infrastructure (VDI) systems, where a developer's workstation is hosted on a virtualized infrastructure. Companies that implement VDI environments like the centralized control of environments, and software engineers like the idea of a pre-packaged environment that contains all the components required to develop, debug, and build an application. What software engineers do not like about VDI environments is network issues where their IDEs become sluggish and frustrating to use. Development containers leverage the same concept as VDI environments but bring it to a local workstation, allowing engineers to use their locally installed IDE while being remotely connected to a running container. This way, the engineer has the experience of local development while connected to a running container. Development containers do require an IDE that supports the pattern. What makes the use of development containers so attractive is that engineers can attach to a container running within a Kubernetes cluster and access services as configured for an actual deployment. In addition, development containers support a first-class development experience, including all the tools a developer would expect to be available in a development environment. From a broader perspective, development containers aren't limited to local deployments. When configured for access, cloud environments can provide the same first-class development experience. Here, the deployment abstraction provided by containerized orchestration layers really shines. Figure 2. Microservice development container configured with dev containers The Synergistic Evolution of Cloud-Native Development Continues There's a synergy across shift-left, cloud-native, and microservices development. They present a pattern for application development that can be adopted by teams of any size. Tooling continues to evolve, making practical use of the technologies involved in cloud-native environments accessible to all involved in the application delivery process. It is a culture change that entails a change in mindset while learning new processes and technologies. It's important that teams aren't burdened with a collection of manual processes where they feel their productivity is being lost. Automation helps ease teams into the adoption of the pattern and technologies. As with any other organizational change, upfront planning and preparation is important. Just as important is involving the teams in the plan. When individuals have a say in change, ownership and adoption become a natural outcome. This is an excerpt from DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report
Debugging application issues in a Kubernetes cluster can often feel like navigating a labyrinth. Containers are ephemeral by design and intended to be immutable once deployed. This presents a unique challenge when something goes wrong and we need to dig into the issue. Before diving into the debugging tools and techniques, it's essential to grasp the core problem: why modifying container instances directly is a bad idea. This blog post will walk you through the intricacies of Kubernetes debugging, offering insights and practical tips to effectively troubleshoot your Kubernetes environment. The Problem With Kubernetes Video The Immutable Nature of Containers One of the fundamental principles of Kubernetes is the immutability of container instances. This means that once a container is running, it shouldn't be altered. Modifying containers on the fly can lead to inconsistencies and unpredictable behavior, especially as Kubernetes orchestrates the lifecycle of these containers, replacing them as needed. Imagine trying to diagnose an issue only to realize that the container you’re investigating has been modified, making it difficult to reproduce the problem consistently. The idea behind this immutability is to ensure that every instance of a container is identical to any other instance. This consistency is crucial for achieving reliable, scalable applications. If you start modifying containers, you undermine this consistency, leading to a situation where one container behaves differently from another, even though they are supposed to be identical. The Limitations of kubectl exec We often start our journey in Kubernetes with commands such as: $ kubectl -- exec -ti <pod-name> This logs into a container and feels like accessing a traditional server with SSH. However, this approach has significant limitations. Containers often lack basic diagnostic tools—no vim, no traceroute, sometimes not even a shell. This can be a rude awakening for those accustomed to a full-featured Linux environment. Additionally, if a container crashes, kubectl exec becomes useless as there's no running instance to connect to. This tool is insufficient for thorough debugging, especially in production environments. Consider the frustration of logging into a container only to find out that you can't even open a simple text editor to check configuration files. This lack of basic tools means that you are often left with very few options for diagnosing problems. Moreover, the minimalistic nature of many container images, designed to reduce their attack surface and footprint, exacerbates this issue. Avoiding Direct Modifications While it might be tempting to install missing tools on the fly using commands like apt-get install vim, this practice violates the principle of container immutability. In production, installing packages dynamically can introduce new dependencies, potentially causing application failures. The risks are high, and it's crucial to maintain the integrity of your deployment manifests, ensuring that all configurations are predefined and reproducible. Imagine a scenario where a quick fix in production involves installing a missing package. This might solve the immediate problem but could lead to unforeseen consequences. Dependencies introduced by the new package might conflict with existing ones, leading to application instability. Moreover, this approach makes it challenging to reproduce the exact environment, which is vital for debugging and scaling your application. Enter Ephemeral Containers The solution to the aforementioned problems lies in ephemeral containers. Kubernetes allows the creation of these temporary containers within the same pod as the application container you need to debug. These ephemeral containers are isolated from the main application, ensuring that any modifications or tools installed do not impact the running application. Ephemeral containers provide a way to bypass the limitations of kubectl exec without violating the principles of immutability and consistency. By launching a separate container within the same pod, you can inspect and diagnose the application container without altering its state. This approach preserves the integrity of the production environment while giving you the tools you need to debug effectively. Using kubectl debug The kubectl debug command is a powerful tool that simplifies the creation of ephemeral containers. Unlike kubectl exec, which logs into the existing container, kubectl debug creates a new container within the same namespace. This container can run a different OS, mount the application container’s filesystem, and provide all necessary debugging tools without altering the application’s state. This method ensures you can inspect and diagnose issues even if the original container is not operational. For example, let’s consider a scenario where we’re debugging a container using an ephemeral Ubuntu container: kubectl debug <myapp> -it <pod-name> --image=ubuntu --share-process --copy-to=<myapp-debug> This command launches a new Ubuntu-based container within the same pod, providing a full-fledged environment to diagnose the application container. Even if the original container lacks a shell or crashes, the ephemeral container remains operational, allowing you to perform necessary checks and install tools as needed. It relies on the fact that we can have multiple containers in the same pod, that way we can inspect the filesystem of the debugged container without physically entering that container. Practical Application of Ephemeral Containers To illustrate, let’s delve deeper into how ephemeral containers can be used in real-world scenarios. Suppose you have a container that consistently crashes due to a mysterious issue. By deploying an ephemeral container with a comprehensive set of debugging tools, you can monitor the logs, inspect the filesystem, and trace processes without worrying about the constraints of the original container environment. For instance, you might encounter a situation where an application container crashes due to an unhandled exception. By using kubectl debug, you can create an ephemeral container that shares the same network namespace as the original container. This allows you to capture network traffic and analyze it to understand if there are any issues related to connectivity or data corruption. Security Considerations While ephemeral containers reduce the risk of impacting the production environment, they still pose security risks. It’s critical to restrict access to debugging tools and ensure that only authorized personnel can deploy ephemeral containers. Treat access to these systems with the same caution as handing over the keys to your infrastructure. Ephemeral containers, by their nature, can access sensitive information within the pod. Therefore, it is essential to enforce strict access controls and audit logs to track who is deploying these containers and what actions are being taken. This ensures that the debugging process does not introduce new vulnerabilities or expose sensitive data. Interlude: The Role of Observability While tools like kubectl exec and kubectl debug are invaluable for troubleshooting, they are not replacements for comprehensive observability solutions. Observability allows you to monitor, trace, and log the behavior of your applications in real time, providing deeper insights into issues without the need for intrusive debugging sessions. These tools aren't meant for everyday debugging: that role should be occupied by various observability tools. I will discuss observability in more detail in an upcoming post. Command Line Debugging While tools like kubectl exec and kubectl debug are invaluable, there are times when you need to dive deep into the application code itself. This is where we can use command line debuggers. Command line debuggers allow you to inspect the state of your application at a very granular level, stepping through code, setting breakpoints, and examining variable states. Personally, I don't use them much. For instance, Java developers can use jdb, the Java Debugger, which is analogous to gdb for C/C++ programs. Here’s a basic rundown of how you might use jdb in a Kubernetes environment: 1. Set Up Debugging First, you need to start your Java application with debugging enabled. This typically involves adding a debug flag to your Java command. However, as discussed in my post here, there's an even more powerful way that doesn't require a restart: java -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005 -jar myapp.jar 2. Port Forwarding Since the debugger needs to connect to the application, you’ll set up port forwarding to expose the debug port of your pod to your local machine. This is important as JDWP is dangerous: kubectl port-forward <pod-name> 5005:5005 3. Connecting the Debugger With port forwarding in place, you can now connect jdb to the remote application: jdb -attach localhost:5005 From here, you can use jdb commands to set breakpoints, step through code, and inspect variables. This process allows you to debug issues within the code itself, which can be invaluable for diagnosing complex problems that aren’t immediately apparent through logs or superficial inspection. Connecting a Standard IDE for Remote Debugging I prefer IDE debugging by far. I never used JDB for anything other than a demo. Modern IDEs support remote debugging, and by leveraging Kubernetes port forwarding, you can connect your IDE directly to a running application inside a pod. To set up remote debugging we start with the same steps as the command line debugging. Configuring the application and setting up the port forwarding. 1. Configure the IDE In your IDE (e.g., IntelliJ IDEA, Eclipse), set up a remote debugging configuration. Specify the host as localhost and the port as 5005. 2. Start Debugging Launch the remote debugging session in your IDE. You can now set breakpoints, step through code, and inspect variables directly within the IDE, just as if you were debugging a local application. Conclusion Debugging Kubernetes environments requires a blend of traditional techniques and modern tools designed for container orchestration. Understanding the limitations of kubectl exec and the benefits of ephemeral containers can significantly enhance your troubleshooting process. However, the ultimate goal should be to build robust observability into your applications, reducing the need for ad-hoc debugging and enabling proactive issue detection and resolution. By following these guidelines and leveraging the right tools, you can navigate the complexities of Kubernetes debugging with confidence and precision. In the next installment of this series, we’ll delve into common configuration issues in Kubernetes and how to address them effectively.
In this article, I want to discuss test containers and Golang, how to integrate them into a project, and why it is necessary. Testcontainers Review Testcontainers is a tool that enables developers to utilize Docker containers during testing, providing isolation and maintaining an environment that closely resembles production. Why do we need to use it? Some points: Importance of Writing Tests Ensures code quality by identifying and preventing errors. Facilitates safer code refactoring. Acts as documentation for code functionality. Introduction to Testcontainers Library for managing Docker containers within tests. Particularly useful when applications interact with external services. Simplifies the creation of isolated testing environments. Support Testcontainers-go in Golang Port of the Testcontainers library for Golang. Enables the creation and management of Docker containers directly from tests. Streamlines integration testing by providing isolated and reproducible environments. Ensures test isolation, preventing external factors from influencing results. Simplifies setup and teardown of containers for testing. Supports various container types, including databases, caches, and message brokers. Integration Testing Offers isolated environments for integration testing. Convenient methods for starting, stopping, and obtaining container information. Facilitates seamless integration of Docker containers into the Golang testing process. So, the key point to highlight is that we don't preconfigure the environment outside of the code; instead, we create an isolated environment from the code. This allows us to achieve isolation for both individual and all tests simultaneously. For example, we can set up a single MongoDB for all tests and work with it within integration tests. However, if we need to add Redis for a specific test, we can do so through the code. Let’s explore its application through an example of a portfolio management service developed in Go. Service Description The service is a REST API designed for portfolio management. It utilizes MongoDB for data storage and Redis for caching queries. This ensures fast data access and reduces the load on the primary storage. Technologies Go: The programming language used to develop the service. MongoDB: Document-oriented database employed for storing portfolio data. Docker and Docker Compose: Used for containerization and local deployment of the service and its dependencies. Testcontainers-go: Library for integration testing using Docker containers in Go tests. Testing Using Testcontainers Test containers allow integration testing of the service under conditions closely resembling a real environment, using Docker containers for dependencies. Let’s provide an example of a function to launch a MongoDB container in tests: Go func RunMongo(ctx context.Context, t *testing.T, cfg config.Config) testcontainers.Container { mongodbContainer, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{ ContainerRequest: testcontainers.ContainerRequest{ Image: mongoImage, ExposedPorts: []string{listener}, WaitingFor: wait.ForListeningPort(mongoPort), Env: map[string]string{ "MONGO_INITDB_ROOT_USERNAME": cfg.Database.Username, "MONGO_INITDB_ROOT_PASSWORD": cfg.Database.Password, }, }, Started: true, }) if err != nil { t.Fatalf("failed to start container: %s", err) } return mongodbContainer } And a part of the example: Go package main_test import ( "context" "testing" "github.com/testcontainers/testcontainers-go" "github.com/testcontainers/testcontainers-go/wait" ) func TestMongoIntegration(t *testing.T) { ctx := context.Background() // Replace cfg with your actual configuration cfg := config.Config{ Database: struct { Username string Password string Collection string }{ Username: "root", Password: "example", Collection: "test_collection", }, } // Launching the MongoDB container mongoContainer := RunMongo(ctx, t, cfg) defer mongoContainer.Terminate(ctx) // Here you can add code for initializing MongoDB, for example, creating a client to interact with the database // Here you can run tests using the started MongoDB container // ... // Example test that checks if MongoDB is available if err := checkMongoAvailability(mongoContainer, t); err != nil { t.Fatalf("MongoDB is not available: %s", err) } // Here you can add other tests in your scenario // ... } // Function to check the availability of MongoDB func checkMongoAvailability(container testcontainers.Container, t *testing.T) error { host, err := container.Host(ctx) if err != nil { return err } port, err := container.MappedPort(ctx, "27017") if err != nil { return err } // Here you can use host and port to create a client and check the availability of MongoDB // For example, attempt to connect to MongoDB and execute a simple query return nil } How to run tests: go test ./… -v This test will use Testcontainers to launch a MongoDB container and then conduct integration tests using the started container. Replace `checkMongoAvailability` with the tests you need. Please ensure that you have the necessary dependencies installed before using this example, including the `testcontainers-go` library and other libraries used in your code. Now, it is necessary to relocate the operation of the MongoDB Testcontainer into the primary test method. This adjustment allows for the execution of the Testcontainer a single time. Go var mongoAddress string func TestMain(m *testing.M) { ctx := context.Background() cfg := CreateCfg(database, collectionName) mongodbContainer, err := RunMongo(ctx, cfg) if err != nil { log.Fatal(err) } defer func() { if err := mongodbContainer.Terminate(ctx); err != nil { log.Fatalf("failed to terminate container: %s", err) } }() mappedPort, err := mongodbContainer.MappedPort(ctx, "27017") mongoAddress = "mongodb://localhost:" + mappedPort.Port() os.Exit(m.Run()) } And now, our test should be: Go func TestFindByID(t *testing.T) { ctx := context.Background() cfg := CreateCfg(database, collectionName) cfg.Database.Address = mongoAddress client := GetClient(ctx, t, cfg) defer client.Disconnect(ctx) collection := client.Database(database).Collection(collectionName) testPortfolio := pm.Portfolio{ Name: "John Doe", Details: "Software Developer", } insertResult, err := collection.InsertOne(ctx, testPortfolio) if err != nil { t.Fatal(err) } savedObjectID, ok := insertResult.InsertedID.(primitive.ObjectID) if !ok { log.Fatal("InsertedID is not an ObjectID") } service, err := NewMongoPortfolioService(cfg) if err != nil { t.Fatal(err) } foundPortfolio, err := service.FindByID(ctx, savedObjectID.Hex()) if err != nil { t.Fatal(err) } assert.Equal(t, testPortfolio.Name, foundPortfolio.Name) assert.Equal(t, testPortfolio.Details, foundPortfolio.Details) } Ok, but Do We Already Have Everything Inside the Makefile? Let's figure it out—what advantages do test containers offer now? Long before, we used to write tests and describe the environment in a makefile, where scripts were used to set up the environment. Essentially, it was the same Docker compose and the same environment setup, but we did it in one place and for everyone at once. Does it make sense for us to migrate to test containers? Let's conduct a brief comparison between these two approaches. Isolation and Autonomy Testcontainers ensure the isolation of the testing environment during tests. Each test launches its container, guaranteeing that changes made by one test won’t affect others. Ease of Configuration and Management Testcontainers simplifies configuring and managing containers. You don’t need to write complex Makefile scripts for deploying databases; instead, you can use the straightforward Testcontainers API within your tests. Automation and Integration With Test Suites Utilizing Testcontainers enables the automation of container startup and shutdown within the testing process. This easily integrates into test scenarios and frameworks. Quick Test Environment Setup Launching containers through Testcontainers is swift, expediting the test environment preparation process. There’s no need to wait for containers to be ready, as is the case when using a Makefile. Enhanced Test Reliability Starting a container in a test brings the testing environment closer to reality. This reduces the likelihood of false positives and increases test reliability. In conclusion, incorporating Testcontainers into tests streamlines the testing process, making it more reliable and manageable. It also facilitates using a broader spectrum of technologies and data stores. Conclusion In conclusion, it's worth mentioning that delaying transitions from old approaches to newer and simpler ones is not advisable. Often, this leads to the accumulation of significant complexity and requires ongoing maintenance. Most of the time, our scripts set up an entire test environment right on our computers, but why? In the test environment, we have everything — Kafka, Redis, and Istio with Prometheus. Do we need all of this just to run a couple of integration tests for the database? The answer is obviously no. The main idea of such tests is complete isolation from external factors and writing them as close to the subject domain and integrations as possible. As practice shows, these tests fit well into CI/CD under the profile or stage named e2e, allowing them to be run in isolation wherever you have Docker! Ultimately, if you have a less powerful laptop or prefer running everything in runners or on your company's resources, this case is for you! Thank you for your time, and I wish you the best of luck! I hope the article proves helpful! Code DrSequence/testcontainer-contest Read More Testcontainers MongoDB Module MongoDB
Twenty years ago, software was eating the world. Then around a decade ago, containers started eating software, heralded by the arrival of open source OCI standards. Suddenly, developers were able to package an application artifact in a container — sometimes all by themselves. And each container image could technically run anywhere — especially in cloud infrastructure. No more needing to buy VM licenses, look for Rackspace and spare servers, and no more contacting the IT Ops department to request provisioning. Unfortunately, the continuing journey of deploying containers throughout all enterprise IT estates hasn’t been all smooth sailing. Dev teams are confronted with an ever-increasing array of options for building and configuring multiple container images to support unique application requirements and different underlying flavors of commercial and open-source platforms. Even if a developer becomes an expert in docker build, and the team has enough daily time to keep track of changes across all components and dependencies, they are likely to see functional and security gaps appearing within their expanding container fleet. Fortunately, we are seeing a bright spot in the evolution of Cloud Native Buildpacks, an open-source implementation project pioneered at Heroku and adopted early at Pivotal, which is now under the wing of the CNCF. Paketo Buildpacks is an open-source implementation of Cloud Native Buildpacks currently owned by the Cloud Foundry Foundation. Paketo automatically compiles and encapsulates developer application code into containers. Here’s how this latest iteration of buildpacks supports several important developer preferences and development team initiatives. Open Source Interoperability Modern developers appreciate the ability to build on open-source technology whenever they can, but it’s not always that simple to decide between open-source solutions when vendors and end-user companies have already made architectural decisions and set standards. Even in an open-source-first shop, many aspects of the environment will be vendor-supported and offer opinionated stacks for specific delivery platforms. Developers love to utilize buildpacks because they allow them to focus on coding business logic, rather than the infinite combinations of deployment details. Dealing with both source and deployment variability is where Paketo differentiates itself from previous containerization approaches. So, it doesn’t matter whether the developer codes in Java, Go, nodeJS, or Python, Paketo can compile ready-to-run containers. And, it doesn’t matter which cloud IaaS resource or on-prem server it runs on. “I think we're seeing a lot more developers who have a custom platform with custom stacks, but they keep coming back to Paketo Buildpacks because they can actually plug them into a modular system,” said Forest Eckhardt, contributor and maintainer to the Paketo project. “I think that adoption is going well, a lot of the adopters that we see are DevOps or Operations leaders who are trying to deliver applications for their clients and external teams.” Platform Engineering With Policy Platform engineering practices give developers shared, self-service resources and environments for development work, reducing setup costs and time, and encouraging code, component, and configuration reuse. These common platform engineering environments can be offered within a self-service internal portal or an external partner development portal, sometimes accompanied by support from a platform team that curates and reviews all elements of the platform. If the shared team space has too many random uploads, developers will not be able to distinguish the relative utility or safety of various unvalidated container definitions and packages. Proper governance means giving developers the ability to build to spec — without having to slog through huge policy checklists. Buildpacks take much of the effort and risk out of the ‘last mile’ of platform engineering. Developers can simply bring their code, and Paketo Buildpacks detects the language, gathers dependencies, and builds a valid container image that fits within the chosen methodology and policies of the organization. DevOps-Speed Automation In addition to empowering developers with self-service resources, automating everything as much as possible is another core tenet of the DevOps movement. DevOps is usually represented as a continuous infinity loop, where each change the team promotes in the design/development/build/deploy lifecycle should be executed by automated processes, including production monitoring and feedback to drive the next software delivery cycle. Any manual intervention in the lifecycle should be looked at as the next potential constraint to be addressed. If developers are spending time setting up Dockerfiles and validating containers, that’s less time spent creating new functionality or debugging critical issues. Software Supply Chain Assurance Developers want to move fast, so they turn to existing code and infrastructure examples that are working for peers. Heaps of downloadable packages and source code snippets are ready to go on npm StackOverflow and DockerHub – many with millions of downloads and lots of upvotes and review stars. The advent of such public development resources and git-style repositories offers immense value for the software industry as a whole, but by nature, it also provides an ideal entry point for software supply chain (or SSC) attacks. Bad actors can insert malware and irresponsible ones can leave behind vulnerabilities. Scanning an application once exploits are baked in can be difficult. It’s about time the software industry started taking a page from other discrete industries like high-tech manufacturing and pharmaceuticals that rely on tight governance of their supply chains to maximize customer value with reduced risk. For instance, an automotive brand would want to know the provenance of every part that goes into a car they manufacture, a complete bill-of-materials (or BOM) including both its supplier history and its source material composition. Paketo Buildpacks automatically generates an SBOM (software bill-of-materials) during each build process, attached to the image, so there’s no need to rely on external scanning tools. The SBOM documents information about every component in the packaged application, for instance, that it was written with Go version 1.22.3, even though that original code was compiled. The Intellyx Take Various forms of system encapsulation routines have been around for years, well before Docker appeared. Hey, containers even existed on mainframes. But there’s something distinct about this current wave of containerization for a cloud-native world. Paketo Buildpacks provides application delivery teams with total flexibility in selecting their platforms and open-source components of choice, with automation and reproducibility. Developers can successfully build the same app, in the same way, thousands of times in a row, even if underlying components are updated. That’s why so many major development shops are moving toward modern buildpacks, and removing the black box around containerization — no matter what deployment platform and methodology they espouse. ©2024 Intellyx B.V. Intellyx is editorially responsible for this document. At the time of writing, Cloud Foundry Foundation is an Intellyx customer. No AI bots were used to write this content. Image source: Adobe Express AI.
We have a somewhat bare-bones chat service in our series so far. Our service exposes endpoints for managing topics and letting users post messages in topics. For a demo, we have been using a makeshift in-memory store that shamelessly provides no durability guarantees. A basic and essential building block in any (web) service is a data store (for storing, organizing, and retrieving data securely and efficiently). In this tutorial, we will improve the durability, organization, and persistence of data by introducing a database. There are several choices of databases: in-memory (a very basic form of which we have used earlier), object-oriented databases, key-value stores, relational databases, and more. We will not repeat an in-depth comparison of these here and instead defer to others. Furthermore, in this article, we will use a relational (SQL) database as our underlying data store. We will use the popular GORM library (an ORM framework) to simplify access to our database. There are several relational databases available, both free as well as commercial. We will use Postgres (a very popular, free, lightweight, and easy-to-manage database) for our service. Postgres is also an ideal choice for a primary source-of-truth data store because of the strong durability and consistency guarantees it provides. Setting Up the Database A typical pattern when using a database in a service is: |---------------| |-----------| |------------| |------| | Request Proto | <-> | Service | <-> | ORM/SQL | <-> | DB | |---------------| |-----------| |------------| |------| A gRPC request is received by the service (we have not shown the REST Gateway here). The service converts the model proto (e.g., Topic) contained in the request (e.g., CreateTopicRequest) into the ORM library. The ORM library generates the necessary SQL and executes it on the DB (and returns any results). Setting Up Postgres We could go the traditional way of installing Postgres (by downloading and installing its binaries for the specific platforms). However, this is complicated and brittle. Instead, we will start using Docker (and Docker Compose) going forward for a compact developer-friendly setup. Set Up Docker Set up Docker Desktop for your platform following the instructions. Add Postgres to Docker Compose Now that Docker is set up, we can add different containers to this so we can build out the various components and services OneHub requires. docker-compose.yml: version: '3.9' services: pgadmin: image: dpage/pgadmin4 ports: - ${PGADMIN_LISTEN_PORT}:${PGADMIN_LISTEN_PORT} environment: PGADMIN_LISTEN_PORT: ${PGADMIN_LISTEN_PORT} PGADMIN_DEFAULT_EMAIL: ${PGADMIN_DEFAULT_EMAIL} PGADMIN_DEFAULT_PASSWORD: ${PGADMIN_DEFAULT_PASSWORD} volumes: - ./.pgadmin:/var/lib/pgadmin postgres: image: postgres:15.3 environment: POSTGRES_DB: ${POSTGRES_DB} POSTGRES_USER: ${POSTGRES_USER} POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} volumes: - ./.pgdata:/var/lib/postgresql/data ports: - 5432:5432 healthcheck: test: ["CMD-SHELL", "pg_isready -U postgres"] interval: 5s timeout: 5s retries: 5 That's it. A few key things to note are: The Docker Compose file is an easy way to get started with containers - especially on a single host without needing complicated orchestration engines (hint: Kubernetes). The main part of Docker Compose files are the service sections that describe the containers for each of the services that Docker Compose will be executing as a "single unit in a private network." This is a great way to package multiple related services needed for an application and bring them all up and down in one step instead of having to manage them one by one individually. The latter is not just cumbersome, but also error-prone (manual dependency management, logging, port checking, etc). For now, we have added one service - postgres - running on port 5432. Since the services are running in an isolated context, environment variables can be set to initialize/control the behavior of the services. These environment variables are read from a specific .env file (below). This file can also be passed as a CLI flag or as a parameter, but for now, we are using the default .env file. Some configuration parameters here are the Postgres username, password, and database name. .env: POSTGRES_DB=onehubdb POSTGRES_USER=postgres POSTGRES_PASSWORD=docker ONEHUB_DB_ENDOINT=postgres://postgres:docker@postgres:5432/onehubdb PGADMIN_LISTEN_PORT=5480 PGADMIN_DEFAULT_EMAIL=admin@onehub.com PGADMIN_DEFAULT_PASSWORD=password All data in a container is transient and is lost when the container is shut down. In order to make our database durable, we will store the data outside the container and map it as a volume. This way from within the container, Postgres will read/write to its local directory (/var/lib/postgresql/data) even though all reads/writes are sent to the host's file system (./.pgdata) Another great benefit of using Docker is that all the ports used by the different services are "internal" to the network that Docker creates. This means the same postgres service (which runs on port 5432) can be run on multiple Docker environments without having their ports changed or checked for conflicts. This works because, by default, ports used inside a Docker environment are not exposed outside the Docker environment. Here we have chosen to expose port 5432 explicitly in the ports section of docker-compose.yml. That's it. Go ahead and bring it up: docker compose up If all goes well, you should see a new Postgres database created and initialized with our username, password, and DB parameters from the .env file. The database is now ready: onehub-postgres-1 | 2023-07-28 22:52:32.199 UTC [1] LOG: starting PostgreSQL 15.3 (Debian 15.3-1.pgdg120+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit onehub-postgres-1 | 2023-07-28 22:52:32.204 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 onehub-postgres-1 | 2023-07-28 22:52:32.204 UTC [1] LOG: listening on IPv6 address "::", port 5432 onehub-postgres-1 | 2023-07-28 22:52:32.209 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" onehub-postgres-1 | 2023-07-28 22:52:32.235 UTC [78] LOG: database system was shut down at 2023-07-28 22:52:32 UTC onehub-postgres-1 | 2023-07-28 22:52:32.253 UTC [1] LOG: database system is ready to accept connections The OneHub Docker application should now show up on the Docker desktop and should look something like this: (Optional) Setup a DB Admin Interface If you would like to query or interact with the database (outside code), pgAdmin and adminer are great tools. They can be downloaded as native application binaries, installed locally, and played. This is a great option if you would like to manage multiple databases (e.g., across multiple Docker environments). ... Alternatively ... If it is for this single project and downloading yet another (native app) binary is undesirable, why not just include it as a service within Docker itself!? With that added, our docker-compose.yml now looks like this: docker-compose.yml: version: '3.9' services: pgadmin: image: dpage/pgadmin4 ports: - ${PGADMIN_LISTEN_PORT}:${PGADMIN_LISTEN_PORT} environment: PGADMIN_LISTEN_PORT: ${PGADMIN_LISTEN_PORT} PGADMIN_DEFAULT_EMAIL: ${PGADMIN_DEFAULT_EMAIL} PGADMIN_DEFAULT_PASSWORD: ${PGADMIN_DEFAULT_PASSWORD} volumes: - ./.pgadmin:/var/lib/pgadmin postgres: image: postgres:15.3 environment: POSTGRES_DB: ${POSTGRES_DB} POSTGRES_USER: ${POSTGRES_USER} POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} volumes: - ./.pgdata:/var/lib/postgresql/data ports: - 5432:5432 healthcheck: test: ["CMD-SHELL", "pg_isready -U postgres"] interval: 5s timeout: 5s retries: 5 The accompanying environment variables are in our .env file: .env: POSTGRES_DB=onehubdb POSTGRES_USER=postgres POSTGRES_PASSWORD=docker ONEHUB_DB_ENDOINT=postgres://postgres:docker@postgres:5432/onehubdb PGADMIN_LISTEN_PORT=5480 PGADMIN_DEFAULT_EMAIL=admin@onehub.com PGADMIN_DEFAULT_PASSWORD=password Now you can simply visit the pgAdmin web console on your browser. Use the email and password specified in the .env file and off you go! To connect to the Postgres instance running in the Docker environment, simply create a connection to postgres (NOTE: container local DNS names within the Docker environment are the service names themselves). On the left-side Object Explorer panel, (right) click on Servers >> Register >> Server... and give a name to your server ("postgres"). In the Connection tab, use the hostname "postgres" and set the names of the database, username, and password as set in the .env file for the POSTGRES_DB, POSTGRES_USER, and POSTGRES_PASSWORD variables respectively. Click Save, and off you go! Introducing Object Relational Mappers (ORMs) Before we start updating our service code to access the database, you may be wondering why the gRPC service itself is not packaged in our docker-compose.yml file. Without this, we would still have to start our service from the command line (or a debugger). This will be detailed in a future post. In a typical database, initialization (after the user and DB setup) would entail creating and running SQL scripts to create tables, checking for new versions, and so on. One example of a table creation statement (that can be executed via psql or pgadmin) is: CREATE TABLE topics ( id STRING NOT NULL PRIMARY KEY, created_at DATETIME DEFAULT CURRENT_TIMESTAMP, updated_at DATETIME DEFAULT CURRENT_TIMESTAMP, name STRING NOT NULL, users TEXT[], ); Similarly, an insertion would also have been manual construction of SQL statements, e.g.: INSERT INTO topics ( id, name ) VALUES ( "1", "Taylor Swift" ); ... followed by a verification of the saved results: select * from topics ; This can get pretty tedious (and error-prone with vulnerability to SQL injection attacks). SQL expertise is highly valuable but seldom feasible - especially being fluent with the different standards, different vendors, etc. Even though Postgres does a great job in being as standards-compliant as possible - for developers - some ease of use with databases is highly desirable. Here ORM libraries are indispensable, especially for developers not dealing with SQL on a regular basis (e.g., yours truly). ORM (Object Relational Mappers) provide an object-like interface to a relational database. This simplifies access to data in our tables (i.e., rows) as application-level classes (Data Access Objects). Table creations and migrations can also be managed by ORM libraries. Behind the scenes, ORM libraries are generating and executing SQL queries on the underlying databases they accessing. There are downsides to using an ORM: ORMs still incur a learning cost for developers during adoption. Interface design choices can play a role in impacting developer productivity. ORMs can be thought of as a schema compiler. The underlying SQL generated by them may not be straightforward or efficient. This results in ORM access to a database being slower than raw SQL, especially for complex queries. However, for complex queries or complex data pattern accesses, other scalability techniques may need to be applied (e.g., sharding, denormalization, etc.). The queries generated by ORMs may not be clear or straightforward, resulting in increased debugging times on slow or complex queries. Despite these downsides, ORMs can be put to good use when not overly relied upon. We shall use a popular ORM library, GORM. GORM comes with a great set of examples and documentation and the quick start is a great starting point. Create DB Models GORM models are our DB models. GORM models are simple Golang structs with struct tags on each member to identify the member's database type. Our User, Topic and Message models are simply this: Topic, Message, User Models package datastore import ( "time" "github.com/lib/pq" ) type BaseModel struct { CreatedAt time.Time UpdatedAt time.Time Id string `gorm:"primaryKey"` Version int // used for optimistic locking } type User struct { BaseModel Name string Avatar string ProfileData map[string]interface{} `gorm:"type:json"` } type Topic struct { BaseModel CreatorId string Name string `gorm:"index:SortedByName"` Users pq.StringArray `gorm:"type:text[]"` } type Message struct { BaseModel ParentId string TopicId string `gorm:"index:SortedByTopicAndCreation,priority:1"` CreatedAt time.Time `gorm:"index:SortedByTopicAndCreation,priority:2"` SourceId string UserId string ContentType string ContentText string ContentData map[string]interface{} `gorm:"type:json"` } Why are these models needed when we have already defined models in our .proto files? Recall that the models we use need to reflect the domain they are operating in. For example, our gRPC structs (in .proto files) reflect the models and programming models from the application's perspective. If/When we build a UI, view-models would reflect the UI/view perspectives (e.g., a FrontPage view model could be a merge of multiple data models). Similarly, when storing data in a database, the models need to convey intent and type information that can be understood and processed by the database. This is why GORM expects data models to have annotations on its (struct) member variables to convey database-specific information like column types, index definitions, index column orderings, etc. A good example of this in our data model is the SortByTopicAndCreation index (which, as the name suggests, helps us list topics sorted by their creation timestamp). Database indexes are one or more (re)organizations of data in a database that speed up retrievals of certain queries (at the cost of increased write times and storage space). We won't go into indexes deeply. There are fantastic resources that offer a deep dive into the various internals of database systems in great detail (and would be highly recommended). The increased writes and storage space must be considered when creating more indexes in a database. We have (in our service) been mindful about creating more indexes and kept these to the bare minimum (to suit certain types of queries). As we scale our services (in future posts) we will revisit how to address these costs by exploring asynchronous and distributed index-building techniques. Data Access Layer Conventions We now have DB models. We could at this point directly call the GORM APIs from our service implementation to read and write data from our (Postgres) database; but first, a brief detail on the conventions we have decided to choose. Motivations Database use can be thought of as being in two extreme spectrums: On the one hand, a "database" can be treated as a better filesystem with objects written by some key to prevent data loss. Any structure, consistency guarantees, optimization, or indexes are fully the responsibility of the application layer. This gets very complicated, error-prone, and hard very fast. On the other extreme, use the database engine as the undisputed brain (the kitchen sink) of your application. Every data access for every view in your application is offered (only) by one or very few (possibly complex) queries. This view, while localizing data access in a single place, also makes the database a bottleneck and its scalability daunting. In reality, vertical scaling (provisioning beefier machines) is the easiest, but most expensive solution - which most vendors will happily recommend in such cases. Horizontal scaling (getting more machines) is hard as increased data coupling and probabilities of node failures (network partitions) mean more complicated and careful tradeoffs between consistency and availability. Our sweet spot is somewhere in between. While ORMs (like GORM) provide an almost 1:1 interface compatibility between SQL and the application needs, being judicious with SQL remains advantageous and should be based on the (data and operational) needs of the application. For our chat application, some desirable (data) traits are: Messages from users must not be lost (durability). Ordering of messages is important (within a topic). Few standard query types: CRUD on Users, Topics, and Messages Message ordering by timestamp but limited to either within a topic or by a user (for last N messages) Given our data "shapes" are simple and given the read usage of our system is much higher especially given the read/write application (i.e .,1 message posted is read by many participants on a Topic), we are choosing to optimize for write consistency, simplicity and read availability, within a reasonable latency). Now we are ready to look at the query patterns/conventions. Unified Database Object First, we will add a simple data access layer that will encapsulate all the calls to the database for each particular model (topic, messages, users). Let us create an overarching "DB" object that represents our Postgres DB (in db/db.go): type OneHubDB struct { storage *gorm.DB } This tells GORM that we have a database object (possibly with a connection) to the underlying DB. The Topic Store, User Store, and Message Store modules all operate on this single DB instance (via GORM) to read/write data from their respective tables (topics, users, messages). Note that this is just one possible convention. We could have instead used three different DB (gorm.DB) instances, one for each entity type: e.g., TopicDB, UserDB, and MessageDB. Use Custom IDs Instead of Auto-Generated Ones We are choosing to generate our own primary key (IDs) for topics, users, and messages instead of depending on the auto-increment (or auto-id) generation by the database engine. This was for the following reasons: An auto-generated key is localized to the database instance that generates it. This means if/when we add more partitions to our databases (for horizontal scaling) these keys will need to be synchronized and migrating existing keys to avoid duplications at a global level is much harder. Auto increment keys offer reduced randomness, making it easy for attackers to "iterate" through all entities. Sometimes we may simply want string keys that are custom assignable if they are available (for SEO purposes). Lack of attribution to keys (e.g., a central/global key server can also allow attribution/annotation to keys for analytics purposes). For these purposes, we have added a GenId table that keeps track of all used IDs so we can perform collision detection, etc: type GenId struct { Class string `gorm:"primaryKey"` Id string `gorm:"primaryKey"` CreatedAt time.Time } Naturally, this is not a scalable solution when the data volume is large, but suffices for our demo and when needed, we can move this table to a different DB and still preserve the keys/IDs. Note that GenId itself is also managed by GORM and uses a combination of Class + Id as its primary key. An example of this is Class=Topic and Id=123. Random IDs are assigned by the application in a simple manner: func randid(maxlen int) string { max_id := int64(math.Pow(36, maxlen)) randval := rand.Int63() % max_id return strconv.FormatInt(randval, 36) } func (tdb *OneHubDB) NextId(cls string) string { for { gid := GenId{Id: randid(), Class: cls, CreatedAt: time.Now()} err := tdb.storage.Create(gid).Error log.Println("ID Create Error: ", err) if err == nil { return gid.Id } } } The method randid generates a maxlen-sized string of random characters. This is as simple as (2^63) mod maxid where maxid = 36 ^ maxlen. The NextId method is used by the different entity create methods (below) to repeatedly generate random IDs if collisions exist. In case you are worried about excessive collisions or are interested in understanding their probabilities, you can learn about them here. Judicious Use of Indexes Indexes are very beneficial to speed up certain data retrieval operations at the expense of increased writes and storage. We have limited our use of indexes to a very handful of cases where strong consistency was needed (and could be scaled easily): Topics sorted by name (for an alphabetical sorting of topics) Messages sorted by the topic and creation time stamps (for the message list natural ordering) What is the impact of this on our application? Let us find out. Topic Creations and Indexes When a topic is created (or it is updated) an index write would be required. Topic creations/updates are relatively low-frequency operations (compared to message postings). So a slightly increased write latency is acceptable. In a more realistic chat application, a topic creation is a bit more heavyweight due to the need to check permissions, apply compliance rules, etc. So this latency hit is acceptable. Furthermore, this index would only be needed when "searching" for topics and even an asynchronous index update would have sufficed. Message Related Indexes To consider the usefulness of indexes related to messages, let us look at some usage numbers. This is a very simple application, so these scalability issues most likely won't be a concern (so feel free to skip this section). If your goals are a bit more lofty, looking at Slack's usage numbers we can estimate/project some usage numbers for our own demo to make it interesting: Number of daily active topics: 100 Number of active users per topic: 10 Message sent by an active user in a topic: Every 5 minutes (assume time to type, read other messages, research, think, etc.) Thus, the number of messages created each day is: = 100 * 10 * (1400 minutes in a day / 5 minutes)= 280k messages per day~ 3 messages per second In the context of these numbers, if we were to create a message every 3 seconds, even with an extra index (or three), we can handle this load comfortably in a typical database that can handle 10k IOPS, which is rather modest. It is easy to wonder if this scales as the number of topics or active users per topic or the creation frenzy increases. Let us consider a more intense setup (in a larger or busier organization). Instead of the numbers above, if we had 10k topics and 100 active users with a message every minute (instead of 5 minutes), our write QPS would be: WriteQPS: = 10000 * 100 * 1400 / 1= 1.4B messages per day~ 14k messages per second That is quite a considerable blow-up. We can solve this in a couple of ways: Accept a higher latency on writes - For example, instead of requiring a write to happen in a few milliseconds, accept an SLO of, say, 500ms. Update indexes asynchronously - This doesn't get us that much further, as the number of writes in a system has not changed - only the when has changed. Shard our data Let us look at sharding! Our write QPS is in aggregate. On a per-topic level, it is quite low (14k/10000 = 1.4 qps). However, user behavior for our application is that such activities on a topic are fairly isolated. We only want our messages to be consistent and ordered within a topic - not globally. We now have the opportunity to dynamically scale our databases (or the Messages tables) to be partitioned by topic IDs. In fact, we could build a layer (a control plane) that dynamically spins up database shards and moves topics around reacting to load as and when needed. We will not go that extreme here, but this series is tending towards just that especially in the context of SaaS applications. The _annoyed_ reader might be wondering if this deep dive was needed right now! Perhaps not - but by understanding our data and user experience needs, we can make careful tradeoffs. Going forward, such mini-dives will benefit us immensely to quickly evaluate tradeoffs (e.g., when building/adding new features). Store Specific Implementations Now that we have our basic DB and common methods, we can go to each of the entity methods' implementations. For each of our entity methods, we will create the basic CRUD methods: Create Update Get Delete List/Search The Create and Update methods are combined into a single "Save" method to do the following: If an ID is not provided then treat it as a create. If an ID is provided treat it as an update-or-insert (upsert) operation by using the NextId method if necessary. Since we have a base model, Create and Update will set CreatedAt and UpdatedAt fields respectively. The delete method is straightforward. The only key thing here is instead of leveraging GORM's cascading delete capabilities, we also delete the related entities in a separate call. We will not worry about consistency issues resulting from this (e.g., errors in subsequent delete methods). For the Get method, we will fetch using a standard GORM get-query-pattern based on a common id column we use for all models. If an entity does not exist, then we return a nil. Users DB Our user entity methods are pretty straightforward using the above conventions. The Delete method additionally also deletes all Messages for/by the user first before deleting the user itself. This ordering is to ensure that if the deletion of topics fails, then the user deletion won't proceed giving the caller to retry. package datastore import ( "errors" "log" "strings" "time" "gorm.io/gorm" ) func (tdb *OneHubDB) SaveUser(topic *User) (err error) { db := tdb.storage topic.UpdatedAt = time.Now() if strings.Trim(topic.Id, " ") == "" { return InvalidIDError // create a new one } result := db.Save(topic) err = result.Error if err == nil && result.RowsAffected == 0 { topic.CreatedAt = time.Now() err = tdb.storage.Create(topic).Error } return } func (tdb *OneHubDB) DeleteUser(topicId string) (err error) { err = tdb.storage.Where("topic_id = ?", topicId).Delete(&Message{}).Error if err == nil { err = tdb.storage.Where("id = ?", topicId).Delete(&User{}).Error } return } func (tdb *OneHubDB) GetUser(id string) (*User, error) { var out User err := tdb.storage.First(&out, "id = ?", id).Error if err != nil { log.Println("GetUser Error: ", id, err) if errors.Is(err, gorm.ErrRecordNotFound) { return nil, nil } else { return nil, err } } return &out, err } func (tdb *OneHubDB) ListUsers(pageKey string, pageSize int) (out []*User, err error) { query := tdb.storage.Model(&User{}).Order("name asc") if pageKey != "" { count := 0 query = query.Offset(count) } if pageSize <= 0 || pageSize > tdb.MaxPageSize { pageSize = tdb.MaxPageSize } query = query.Limit(pageSize) err = query.Find(&out).Error return out, err } Topics DB Our topic entity methods are also pretty straightforward using the above conventions. The Delete method additionally also deletes all messages in the topic first before deleting the user itself. This ordering is to ensure that if the deletion of topics fails then the user deletion won't proceed giving the caller a chance to retry. Topic entity methods: package datastore import ( "errors" "log" "strings" "time" "gorm.io/gorm" ) /////////////////////// Topic DB func (tdb *OneHubDB) SaveTopic(topic *Topic) (err error) { db := tdb.storage topic.UpdatedAt = time.Now() if strings.Trim(topic.Id, " ") == "" { return InvalidIDError // create a new one } result := db.Save(topic) err = result.Error if err == nil && result.RowsAffected == 0 { topic.CreatedAt = time.Now() err = tdb.storage.Create(topic).Error } return } func (tdb *OneHubDB) DeleteTopic(topicId string) (err error) { err = tdb.storage.Where("topic_id = ?", topicId).Delete(&Message{}).Error if err == nil { err = tdb.storage.Where("id = ?", topicId).Delete(&Topic{}).Error } return } func (tdb *OneHubDB) GetTopic(id string) (*Topic, error) { var out Topic err := tdb.storage.First(&out, "id = ?", id).Error if err != nil { log.Println("GetTopic Error: ", id, err) if errors.Is(err, gorm.ErrRecordNotFound) { return nil, nil } else { return nil, err } } return &out, err } func (tdb *OneHubDB) ListTopics(pageKey string, pageSize int) (out []*Topic, err error) { query := tdb.storage.Model(&Topic{}).Order("name asc") if pageKey != "" { count := 0 query = query.Offset(count) } if pageSize <= 0 || pageSize > tdb.MaxPageSize { pageSize = tdb.MaxPageSize } query = query.Limit(pageSize) err = query.Find(&out).Error return out, err } Messages DB Message entity methods: package datastore import ( "errors" "strings" "time" "gorm.io/gorm" ) func (tdb *OneHubDB) GetMessages(topic_id string, user_id string, pageKey string, pageSize int) (out []*Message, err error) { user_id = strings.Trim(user_id, " ") topic_id = strings.Trim(topic_id, " ") if user_id == "" && topic_id == "" { return nil, errors.New("Either topic_id or user_id or both must be provided") } query := tdb.storage if topic_id != "" { query = query.Where("topic_id = ?", topic_id) } if user_id != "" { query = query.Where("user_id = ?", user_id) } if pageKey != "" { offset := 0 query = query.Offset(offset) } if pageSize <= 0 || pageSize > 10000 { pageSize = 10000 } query = query.Limit(pageSize) err = query.Find(&out).Error return out, err } // Get messages in a topic paginated and ordered by creation time stamp func (tdb *OneHubDB) ListMessagesInTopic(topic_id string, pageKey string, pageSize int) (out []*Topic, err error) { err = tdb.storage.Where("topic_id= ?", topic_id).Find(&out).Error return } func (tdb *OneHubDB) GetMessage(msgid string) (*Message, error) { var out Message err := tdb.storage.First(&out, "id = ?", msgid).Error if err != nil { if errors.Is(err, gorm.ErrRecordNotFound) { return nil, nil } else { return nil, err } } return &out, err } func (tdb *OneHubDB) ListMessages(topic_id string, pageKey string, pageSize int) (out []*Message, err error) { query := tdb.storage.Where("topic_id = ?").Order("created_at asc") if pageKey != "" { count := 0 query = query.Offset(count) } if pageSize <= 0 || pageSize > tdb.MaxPageSize { pageSize = tdb.MaxPageSize } query = query.Limit(pageSize) err = query.Find(&out).Error return out, err } func (tdb *OneHubDB) CreateMessage(msg *Message) (err error) { msg.CreatedAt = time.Now() msg.UpdatedAt = time.Now() result := tdb.storage.Model(&Message{}).Create(msg) err = result.Error return } func (tdb *OneHubDB) DeleteMessage(msgId string) (err error) { err = tdb.storage.Where("id = ?", msgId).Delete(&Message{}).Error return } func (tdb *OneHubDB) SaveMessage(msg *Message) (err error) { db := tdb.storage q := db.Model(msg).Where("id = ? and version = ?", msg.Id, msg.Version) msg.UpdatedAt = time.Now() result := q.UpdateColumns(map[string]interface{}{ "updated_at": msg.UpdatedAt, "content_type": msg.ContentType, "content_text": msg.ContentText, "content_data": msg.ContentData, "user_id": msg.SourceId, "source_id": msg.SourceId, "parent_id": msg.ParentId, "version": msg.Version + 1, }) err = result.Error if err == nil && result.RowsAffected == 0 { // Must have failed due to versioning err = MessageUpdateFailed } return } The Messages entity methods are slightly more involved. Unlike the other two, Messages entity methods also include Searching by Topic and Searching by User (for ease). This is done in the GetMessages method that provides paginated (and ordered) retrieval of messages for a topic or by a user. Write Converters To/From Service/DB Models We are almost there. Our database is ready to read/write data. It just needs to be invoked by the service. Going back to our original plan: |---------------| |-----------| |--------| |------| | Request Proto | <-> | Service | <-> | GORM | <-> | DB | |---------------| |-----------| |--------| |------| We have our service models (generated by protobuf tools) and we have our DB models that GORM understands. We will now add converters to convert between the two. Converters for entity X will follow these conventions: A method XToProto of type func(input *datastore.X) (out *protos.X) A method XFromProto of type func(input *protos.X) (out *datastore.X) With that one of our converters (for Topics) is quite simply (and boringly): package services import ( "log" "github.com/lib/pq" ds "github.com/panyam/onehub/datastore" protos "github.com/panyam/onehub/gen/go/onehub/v1" "google.golang.org/protobuf/types/known/structpb" tspb "google.golang.org/protobuf/types/known/timestamppb" ) func TopicToProto(input *ds.Topic) (out *protos.Topic) { var userIds map[string]bool = make(map[string]bool) for _, userId := range input.Users { userIds[userId] = true } out = &protos.Topic{ CreatedAt: tspb.New(input.BaseModel.CreatedAt), UpdatedAt: tspb.New(input.BaseModel.UpdatedAt), Name: input.Name, Id: input.BaseModel.Id, CreatorId: input.CreatorId, Users: userIds, } return } func TopicFromProto(input *protos.Topic) (out *ds.Topic) { out = &ds.Topic{ BaseModel: ds.BaseModel{ CreatedAt: input.CreatedAt.AsTime(), UpdatedAt: input.UpdatedAt.AsTime(), Id: input.Id, }, Name: input.Name, CreatorId: input.CreatorId, } if input.Users != nil { var userIds []string for userId := range input.Users { userIds = append(userIds, userId) } out.Users = pq.StringArray(userIds) } return } The full set of converters can be found here - Service/DB Models Converters. Hook Up the Converters in the Service Definitions Our last step is to invoke the converters above in the service implementation. The methods are pretty straightforward. For example, for the TopicService we have: CreateTopic During creation we allow custom IDs to be passed in. If an entity with the ID exists the request is rejected. If an ID is not passed in, a random one is assigned. Creator and Name parameters are required fields. The topic is converted to a "DBTopic" model and saved by calling the SaveTopic method. UpdateTopic All our Update<Entity> methods follow a similar pattern: Fetch the existing entity (by ID) from the DB. Update the entity fields based on fields marked in the update_mask (so patches are allowed). Update with any extra entity-specific operations (e.g., AddUsers, RemoveUsers, etc.) - these are just for convenience so the caller would not have to provide an entire "final" users list each time. Convert the updated proto to a "DB Model." Call SaveTopic on the DB. SaveTopic uses the "version" field in our DB to perform an optimistically concurrent write. This ensures that by the time the model is loaded and it is being written, a write by another request/thread will not be overwritten. The Delete, List and Get methods are fairly straightforward. The UserService and MessageService also are implemented in a very similar way with minor differences to suit specific requirements. Testing It All Out We have a database up and running (go ahead and start it with docker compose up). We have converters to/from service and database models. We have implemented our service code to access the database. We just need to connect to this (running) database and pass a connection object to our services in our runner binary (cmd/server.go): Add an extra flag to accept a path to the DB. This can be used to change the DB path if needed. var ( addr = flag.String("addr", ":9000", "Address to start the onehub grpc server on.") gw_addr = flag.String("gw_addr", ":8080", "Address to start the grpc gateway server on.") db_endpoint = flag.String("db_endpoint", "", fmt.Sprintf("Endpoint of DB where all topics/messages state are persisted. Default value: ONEHUB_DB_ENDPOINT environment variable or %s", DEFAULT_DB_ENDPOINT)) ) Create *gorm.DB instance from the db_endpoint value. We have already created a little utility method for opening a GORM-compatible SQL DB given an address: cmd/utils/db.go: package utils import ( // "github.com/panyam/goutils/utils" "log" "strings" "github.com/panyam/goutils/utils" "gorm.io/driver/postgres" "gorm.io/driver/sqlite" "gorm.io/gorm" ) func OpenDB(db_endpoint string) (db *gorm.DB, err error) { log.Println("Connecting to DB: ", db_endpoint) if strings.HasPrefix(db_endpoint, "sqlite://") { dbpath := utils.ExpandUserPath((db_endpoint)[len("sqlite://"):]) db, err = gorm.Open(sqlite.Open(dbpath), &gorm.Config{}) } else if strings.HasPrefix(db_endpoint, "postgres://") { db, err = gorm.Open(postgres.Open(db_endpoint), &gorm.Config{}) } if err != nil { log.Println("Cannot connect DB: ", db_endpoint, err) } else { log.Println("Successfully connected DB: ", db_endpoint) } return } Now let us create the method OpenOHDB, which is a simple wrapper that also checks for a db_endpoint value from an environment variable (if it is not provided) and subsequently opens a gorm.DB instance needed for a OneHubDB instance: func OpenOHDB() *ds.OneHubDB { if *db_endpoint == "" { *db_endpoint = cmdutils.GetEnvOrDefault("ONEHUB_DB_ENDPOINT", DEFAULT_DB_ENDPOINT) } db, err := cmdutils.OpenDB(*db_endpoint) if err != nil { log.Fatal(err) panic(err) } return ds.NewOneHubDB(db) } With the above two, we need a simple change to our main method: func main() { flag.Parse() ohdb := OpenOHDB() go startGRPCServer(*addr, ohdb) startGatewayServer(*gw_addr, *addr) } Now we shall also pass the ohdb instance to the GRPC service creation methods. And we are ready to test our durability! Remember we set up auth in a previous part, so we need to pass login credentials, albeit fake ones (where password = login + "123"). Create a Topic curl localhost:8080/v1/topics -u auser:auser123 | json_pp { "nextPageKey" : "", "topics" : [] } That's right. We do not have any topics yet so let us create some. curl -X POST localhost:8080/v1/topics \ -u auser:auser123 \ -H 'Content-Type: application/json' \ -d '{"topic": {"name": "First Topic"}' | json_pp Yielding: { "topic" : { "createdAt" : "1970-01-01T00:00:00Z", "creatorId" : "auser", "id" : "q43u", "name" : "First Topic", "updatedAt" : "2023-08-04T08:14:56.413050Z", "users" : {} } } Let us create a couple more: curl -X POST localhost:8080/v1/topics \ -u auser:auser123 \ -H 'Content-Type: application/json' \ -d '{"topic": {"name": "First Topic", "id": "1"}' | json_pp curl -X POST localhost:8080/v1/topics \ -u auser:auser123 \ -H 'Content-Type: application/json' \ -d '{"topic": {"name": "Second Topic", "id": "2"}' | json_pp curl -X POST localhost:8080/v1/topics \ -u auser:auser123 \ -H 'Content-Type: application/json' \ -d '{"topic": {"name": "Third Topic", "id": "3"}' | json_pp With a list query returning: { "nextPageKey" : "", "topics" : [ { "createdAt" : "1970-01-01T00:00:00Z", "creatorId" : "auser", "id" : "q43u", "name" : "First Topic", "updatedAt" : "2023-08-04T08:14:56.413050Z", "users" : {} }, { "createdAt" : "1970-01-01T00:00:00Z", "creatorId" : "auser", "id" : "dejc", "name" : "Second Topic", "updatedAt" : "2023-08-05T06:52:33.923076Z", "users" : {} }, { "createdAt" : "1970-01-01T00:00:00Z", "creatorId" : "auser", "id" : "zuoz", "name" : "Third Topic", "updatedAt" : "2023-08-05T06:52:35.100552Z", "users" : {} } ] } Get Topic by ID We can do a listing as in the previous section. We can also obtain individual topics: curl localhost:8080/v1/topics/q43u -u auser:auser123 | json_pp { "topic" : { "createdAt" : "1970-01-01T00:00:00Z", "creatorId" : "auser", "id" : "q43u", "name" : "First Topic", "updatedAt" : "2023-08-04T08:14:56.413050Z", "users" : {} } } Send and List Messages on a Topic Let us send a few messages on the "First Topic" (id = "q43u"): curl -X POST localhost:8080/v1/topics/q43u/messages -u 'auser:auser123' -H 'Content-Type: application/json' -d '{"message": {"content_text": "Message 1"}' curl -X POST localhost:8080/v1/topics/q43u/messages -u 'auser:auser123' -H 'Content-Type: application/json' -d '{"message": {"content_text": "Message 2"}' curl -X POST localhost:8080/v1/topics/q43u/messages -u 'auser:auser123' -H 'Content-Type: application/json' -d '{"message": {"content_text": "Message 3"}' Now to list them: curl localhost:8080/v1/topics/q43u/messages -u 'auser:auser123' | json_pp { "messages" : [ { "contentData" : null, "contentText" : "Message 1", "contentType" : "", "createdAt" : "0001-01-01T00:00:00Z", "id" : "hlso", "topicId" : "q43u", "updatedAt" : "2023-08-07T05:00:36.547072Z", "userId" : "auser" }, { "contentData" : null, "contentText" : "Message 2", "contentType" : "", "createdAt" : "0001-01-01T00:00:00Z", "id" : "t3lr", "topicId" : "q43u", "updatedAt" : "2023-08-07T05:00:39.504294Z", "userId" : "auser" }, { "contentData" : null, "contentText" : "Message 3", "contentType" : "", "createdAt" : "0001-01-01T00:00:00Z", "id" : "8ohi", "topicId" : "q43u", "updatedAt" : "2023-08-07T05:00:42.598521Z", "userId" : "auser" } ], "nextPageKey" : "" } Conclusion Who would have thought setting up and using a database would have been such a meaty topic? We covered a lot of ground here that will both give us a good "functioning" service as well as a foundation when implementing new ideas in the future: We chose a relational database - Postgres - for its strong modeling capabilities, consistency guarantees, performance, and versatility. We also chose an ORM library (GORM) to improve our velocity and portability if we need to switch to another relational data store. We wrote data models that GORM could use to read/write from the database. We eased the setup by hosting both Postgres and its admin UI (pgAdmin) in a Docker Compose file. We decided to use GORM carefully and judiciously to balance velocity with minimal reliance on complex queries. We discussed some conventions that will help us along in our application design and extensions. We also addressed a way to assess, analyze, and address scalability challenges as they might arise and use that to guide our tradeoff decisions (e.g., type and number of indexes, etc). We wrote converter methods to convert between service and data models. We finally used the converters in our service to offer a "real" persistent implementation of a chat service where messages can be posted and read. Now that we have a "minimum usable app," there are a lot of useful features to add to our service and make it more and more realistic (and hopefully production-ready). Take a breather and see you soon in continuing the exciting adventure! In the next post, we will look at also including our main binary (with gRPC service and REST Gateways) in the Docker Compose environment without sacrificing hot reloading and debugging.
In the rapidly evolving landscape of cloud computing, deploying Docker images across multiple Amazon Web Services (AWS) accounts presents a unique set of challenges and opportunities for organizations aiming for scalability and security. According to the State of DevOps Report 2022, 50% of DevOps adopters are recognized as elite or high-performing organizations. This guide offers a comprehensive blueprint for leveraging AWS services—such as ECS, CodePipeline, and CodeDeploy — combined with the robust Blue/Green deployment strategy, to facilitate seamless Docker deployments. It also emphasizes employing best security practices within a framework designed to streamline and secure deployments across AWS accounts. By integrating CloudFormation with a cross-account deployment strategy, organizations can achieve an unparalleled level of control and efficiency, ensuring that their infrastructure remains both robust and flexible. Proposed Architecture The architecture diagram showcases a robust AWS deployment model that bridges the gap between development and production environments through a series of orchestrated services. It outlines how application code transitions from the development stage, facilitated by AWS CodeCommit, through a testing phase, and ultimately to production. This system uses AWS CodePipeline for continuous integration and delivery, leverages Amazon ECR for container image storage, and employs ECS with Fargate for container orchestration. It provides a clear, high-level view of the path an application takes from code commit to user delivery. Prerequisites To successfully implement the described infrastructure for deploying Docker images on Amazon ECS with a multi-account CodePipeline and Blue/Green deployment strategy, several prerequisites are necessary. Here are the key prerequisites: Create three separate AWS accounts: Development, Test, and Production. Install and configure the AWS Command Line Interface (CLI) and relevant AWS SDKs for scripting and automation. Fork the aws-cicd-cross-account-deployment GitHub repo and add all the files to your CodeCommit. Environment Setup This guide leverages a comprehensive suite of AWS services and tools, meticulously orchestrated to facilitate the seamless deployment of Docker images on Amazon Elastic Container Service (ECS) across multiple AWS accounts. Before we start setting up the environment, use this code repo for the relevant files mentioned in the steps below. 1. IAM Roles and Permissions IAM roles: Create IAM roles required for the deployment process. Use cross-account.yaml template in CloudFormation to create cross-account IAM roles in Test and Production accounts, allowing necessary permissions for cross-account interactions. YAML AWSTemplateFormatVersion: "2010-09-09" Parameters: CodeDeployRoleInThisAccount: Type: CommaDelimitedList Description: Names of existing Roles you want to add to the newly created Managed Policy DevelopmentAccCodePipelinKMSKeyARN: Type: String Description: ARN of the KMS key from the Development/Global Resource Account DevelopmentAccCodePipelineS3BucketARN: Type: String Description: ARN of the S3 Bucket used by CodePipeline in the Development/Global Resource Account DevelopmentAccNumber: Type: String Description: Account Number of the Development Resources Account Resources: CrossAccountAccessRole: Type: 'AWS::IAM::Role' Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Principal: AWS: - !Join [ ":", [ "arn","aws","iam:",!Ref DevelopmentAccNumber,"root" ] ] Service: - codedeploy.amazonaws.com - codebuild.amazonaws.com Action: - 'sts:AssumeRole' Policies: - PolicyName: CrossAccountServiceAccess PolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Action: - 's3:List*' - 's3:Get*' - 's3:Describe*' Resource: '*' - Effect: Allow Action: - 's3:*' Resource: !Ref DevelopmentAccCodePipelineS3BucketARN - Effect: Allow Action: - 'codedeploy:*' - 'codebuild:*' - 'sns:*' - 'cloudwatch:*' - 'codestar-notifications:*' - 'chatbot:*' - 'ecs:*' - 'ecr:*' - 'codedeploy:Batch*' - 'codedeploy:Get*' - 'codedeploy:List*' Resource: '*' - Effect: Allow Action: - 'codedeploy:Batch*' - 'codedeploy:Get*' - 'codedeploy:List*' - 'kms:*' - 'codedeploy:CreateDeployment' - 'codedeploy:GetDeployment' - 'codedeploy:GetDeploymentConfig' - 'codedeploy:GetApplicationRevision' - 'codedeploy:RegisterApplicationRevision' Resource: '*' - Effect: Allow Action: - 'iam:PassRole' Resource: '*' Condition: StringLike: 'iam:PassedToService': ecs-tasks.amazonaws.com KMSAccessPolicy: Type: 'AWS::IAM::ManagedPolicy' Properties: PolicyDocument: Version: '2012-10-17' Statement: - Sid: AllowThisRoleToAccessKMSKeyFromOtherAccount Effect: Allow Action: - 'kms:DescribeKey' - 'kms:GenerateDataKey*' - 'kms:Encrypt' - 'kms:ReEncrypt*' - 'kms:Decrypt' Resource: !Ref DevelopmentAccCodePipelinKMSKeyARN Roles: !Ref CodeDeployRoleInThisAccount S3BucketAccessPolicy: Type: 'AWS::IAM::ManagedPolicy' Properties: PolicyDocument: Version: '2012-10-17' Statement: - Sid: AllowThisRoleToAccessS3inOtherAccount Effect: Allow Action: - 's3:Get*' Resource: !Ref DevelopmentAccCodePipelineS3BucketARN Effect: Allow Action: - 's3:ListBucket' Resource: !Ref DevelopmentAccCodePipelineS3BucketARN Roles: !Ref CodeDeployRoleInThisAccount 2. CodePipeline Configuration Stages and actions: Configure CodePipeline actions for source, build, and deploy stages by running the pipeline.yaml in CloudFormation. Source repository: Use CodeCommit as the source repository for all the files. Add all the files from the demo-app GitHub folder to the repository. 3. Networking Setup VPC Configuration: Utilize the vpc.yaml CloudFormation template to set up the VPC. Define subnets for different purposes, such as public and private. YAML Description: This template deploys a VPC, with a pair of public and private subnets spread across two Availability Zones. It deploys an internet gateway, with a default route on the public subnets. It deploys a pair of NAT gateways (one in each AZ), and default routes for them in the private subnets. Parameters: EnvVar: Description: An environment name that is prefixed to resource names Type: String VpcCIDR: #Description: Please enter the IP range (CIDR notation) for this VPC Type: String PublicSubnet1CIDR: Description: Please enter the IP range (CIDR notation) for the public subnet in the first Availability Zone Type: String PublicSubnet2CIDR: Description: Please enter the IP range (CIDR notation) for the public subnet in the second Availability Zone Type: String PrivateSubnet1CIDR: Description: Please enter the IP range (CIDR notation) for the private subnet in the first Availability Zone Type: String PrivateSubnet2CIDR: Description: Please enter the IP range (CIDR notation) for the private subnet in the second Availability Zone Type: String DBSubnet1CIDR: Description: Please enter the IP range (CIDR notation) for the private subnet in the first Availability Zone Type: String DBSubnet2CIDR: Description: Please enter the IP range (CIDR notation) for the private subnet in the second Availability Zone Type: String vpcname: #Description: Please enter the IP range (CIDR notation) for the private subnet in the second Availability Zone Type: String Resources: VPC: Type: AWS::EC2::VPC Properties: CidrBlock: !Ref VpcCIDR EnableDnsSupport: true EnableDnsHostnames: true Tags: - Key: Name Value: !Ref vpcname InternetGateway: Type: AWS::EC2::InternetGateway Properties: Tags: - Key: Name Value: !Ref EnvVar InternetGatewayAttachment: Type: AWS::EC2::VPCGatewayAttachment Properties: InternetGatewayId: !Ref InternetGateway VpcId: !Ref VPC PublicSubnet1: Type: AWS::EC2::Subnet Properties: VpcId: !Ref VPC AvailabilityZone: !Select [ 0, !GetAZs '' ] CidrBlock: !Ref PublicSubnet1CIDR MapPublicIpOnLaunch: true Tags: - Key: Name Value: !Sub ${EnvVar} Public Subnet (AZ1) PublicSubnet2: Type: AWS::EC2::Subnet Properties: VpcId: !Ref VPC AvailabilityZone: !Select [ 1, !GetAZs '' ] CidrBlock: !Ref PublicSubnet2CIDR MapPublicIpOnLaunch: true Tags: - Key: Name Value: !Sub ${EnvVar} Public Subnet (AZ2) PrivateSubnet1: Type: AWS::EC2::Subnet Properties: VpcId: !Ref VPC AvailabilityZone: !Select [ 0, !GetAZs '' ] CidrBlock: !Ref PrivateSubnet1CIDR MapPublicIpOnLaunch: false Tags: - Key: Name Value: !Sub ${EnvVar} Private Subnet (AZ1) PrivateSubnet2: Type: AWS::EC2::Subnet Properties: VpcId: !Ref VPC AvailabilityZone: !Select [ 1, !GetAZs '' ] CidrBlock: !Ref PrivateSubnet2CIDR MapPublicIpOnLaunch: false Tags: - Key: Name Value: !Sub ${EnvVar} Private Subnet (AZ2) DBSubnet1: Type: AWS::EC2::Subnet Properties: VpcId: !Ref VPC AvailabilityZone: !Select [ 0, !GetAZs '' ] CidrBlock: !Ref DBSubnet1CIDR MapPublicIpOnLaunch: false Tags: - Key: Name Value: !Sub ${EnvVar} DB Subnet (AZ1) DBSubnet2: Type: AWS::EC2::Subnet Properties: VpcId: !Ref VPC AvailabilityZone: !Select [ 1, !GetAZs '' ] CidrBlock: !Ref DBSubnet2CIDR MapPublicIpOnLaunch: false Tags: - Key: Name Value: !Sub ${EnvVar} DB Subnet (AZ2) NatGateway1EIP: Type: AWS::EC2::EIP DependsOn: InternetGatewayAttachment Properties: Domain: vpc NatGateway2EIP: Type: AWS::EC2::EIP DependsOn: InternetGatewayAttachment Properties: Domain: vpc NatGateway1: Type: AWS::EC2::NatGateway Properties: AllocationId: !GetAtt NatGateway1EIP.AllocationId SubnetId: !Ref PublicSubnet1 NatGateway2: Type: AWS::EC2::NatGateway Properties: AllocationId: !GetAtt NatGateway2EIP.AllocationId SubnetId: !Ref PublicSubnet2 PublicRouteTable: Type: AWS::EC2::RouteTable Properties: VpcId: !Ref VPC Tags: - Key: Name Value: !Sub ${EnvVar} Public Routes DefaultPublicRoute: Type: AWS::EC2::Route DependsOn: InternetGatewayAttachment Properties: RouteTableId: !Ref PublicRouteTable DestinationCidrBlock: 0.0.0.0/0 GatewayId: !Ref InternetGateway PublicSubnet1RouteTableAssociation: Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: !Ref PublicRouteTable SubnetId: !Ref PublicSubnet1 PublicSubnet2RouteTableAssociation: Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: !Ref PublicRouteTable SubnetId: !Ref PublicSubnet2 PrivateRouteTable1: Type: AWS::EC2::RouteTable Properties: VpcId: !Ref VPC Tags: - Key: Name Value: !Sub ${EnvVar} Private Routes (AZ1) DefaultPrivateRoute1: Type: AWS::EC2::Route Properties: RouteTableId: !Ref PrivateRouteTable1 DestinationCidrBlock: 0.0.0.0/0 NatGatewayId: !Ref NatGateway1 PrivateSubnet1RouteTableAssociation: Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: !Ref PrivateRouteTable1 SubnetId: !Ref PrivateSubnet1 PrivateRouteTable2: Type: AWS::EC2::RouteTable Properties: VpcId: !Ref VPC Tags: - Key: Name Value: !Sub ${EnvVar} Private Routes (AZ2) DefaultPrivateRoute2: Type: AWS::EC2::Route Properties: RouteTableId: !Ref PrivateRouteTable2 DestinationCidrBlock: 0.0.0.0/0 NatGatewayId: !Ref NatGateway2 PrivateSubnet2RouteTableAssociation: Type: AWS::EC2::SubnetRouteTableAssociation Properties: RouteTableId: !Ref PrivateRouteTable2 SubnetId: !Ref PrivateSubnet2 NoIngressSecurityGroup: Type: AWS::EC2::SecurityGroup Properties: GroupName: "no-ingress-sg" GroupDescription: "Security group with no ingress rule" VpcId: !Ref VPC Outputs: VPC: Description: A reference to the created VPC Value: !Ref VPC PublicSubnets: Description: A list of the public subnets Value: !Join [ ",", [ !Ref PublicSubnet1, !Ref PublicSubnet2 ]] PrivateSubnets: Description: A list of the private subnets Value: !Join [ ",", [ !Ref PrivateSubnet1, !Ref PrivateSubnet2 ]] PublicSubnet1: Description: A reference to the public subnet in the 1st Availability Zone Value: !Ref PublicSubnet1 PublicSubnet2: Description: A reference to the public subnet in the 2nd Availability Zone Value: !Ref PublicSubnet2 PrivateSubnet1: Description: A reference to the private subnet in the 1st Availability Zone Value: !Ref PrivateSubnet1 PrivateSubnet2: Description: A reference to the private subnet in the 2nd Availability Zone Value: !Ref PrivateSubnet2 NoIngressSecurityGroup: Description: Security group with no ingress rule Value: !Ref NoIngressSecurityGroup 4. ECS Cluster and Service Configuration ECS clusters: Create two ECS clusters: one in the Test account and one in the Production account. Service and task definitions: Create ECS services and task definitions in the Test Account using new-ecs-test-infra.yaml CloudFormation templates. YAML Parameters: privatesubnet1: Type: String privatesubnet2: Type: String Resources: ECSService: Type: AWS::ECS::Service # DependsOn: HTTPListener # DependsOn: HTTPSListener Properties: LaunchType: FARGATE Cluster: new-cluster DesiredCount: 0 TaskDefinition: new-taskdef-anycompany DeploymentController: Type: CODE_DEPLOY HealthCheckGracePeriodSeconds: 300 SchedulingStrategy: REPLICA NetworkConfiguration: AwsvpcConfiguration: AssignPublicIp: DISABLED Subnets: [!Ref privatesubnet1 , !Ref privatesubnet2] LoadBalancers: - TargetGroupArn: arn:aws:elasticloadbalancing:us-east-1:487269258483:targetgroup/TargetGroup1/6b75e9eb3289df56 ContainerPort: 80 ContainerName: anycompany-test Create ECS services and task definitions in the Test account using new-ecs-prod-infra.yaml CloudFormation templates. YAML Parameters: privatesubnet1: Type: String privatesubnet2: Type: String Resources: ECSService: Type: AWS::ECS::Service # DependsOn: HTTPListener # DependsOn: HTTPSListener Properties: LaunchType: FARGATE Cluster: new-cluster DesiredCount: 0 TaskDefinition: new-anycompany-prod DeploymentController: Type: CODE_DEPLOY HealthCheckGracePeriodSeconds: 300 SchedulingStrategy: REPLICA NetworkConfiguration: AwsvpcConfiguration: AssignPublicIp: DISABLED Subnets: [!Ref privatesubnet1 , !Ref privatesubnet2] LoadBalancers: - TargetGroupArn: arn:aws:elasticloadbalancing:us-east-1:608377680862:targetgroup/TargetGroup1/d18c87e013000697 ContainerPort: 80 ContainerName: anycompany-test 5. CodeDeploy Blue/Green Deployment CodeDeploy configuration: Configure CodeDeploy for Blue/Green deployments. Deployment groups: Create specific deployment groups for each environment. Deployment configurations: Configure deployment configurations based on your requirements. 6. Notification Setup (SNS) SNS configuration: Manually create an SNS topic for notifications during the deployment process. Notification content: Configure SNS to send notifications for manual approval steps in the deployment pipeline. Pipeline and Deployment 1. Source Stage CodePipeline starts with the source stage, pulling Docker images from the CodeCommit repository. 2. Build Stage The build stage involves building and packaging the Docker images and preparing them for deployment. 3. Deployment to Development Upon approval, the pipeline deploys the Docker images to the ECS cluster in the Development account using a Blue/Green deployment strategy. 4. Testing in Development The deployed application in the Development environment undergoes testing and validation. 5. Deployment to Test If testing in the Development environment is successful, the pipeline triggers the deployment to the ECS cluster in the Test account using the same Blue/Green strategy. 6. Testing in Test The application undergoes further testing in the Test environment. 7. Manual Approval After successful testing in the Test environment, the pipeline triggers an SNS notification and requires manual approval to proceed. 8. Deployment to Production After the approval, the pipeline triggers the deployment to the ECS cluster in the Production account using the Blue/Green strategy. 9. Final Testing in Production The application undergoes final testing in the Production environment. 10. Completion The pipeline completes, and the new version of the application is running in the Production environment. Conclusion In this guide, we’ve explored the strategic approach to deploying Docker images across multiple AWS accounts using a combination of ECS, CodePipeline, CodeDeploy, and the reliability of Blue/Green deployment strategies, all through the power of AWS CloudFormation. This methodology not only enhances security and operational efficiency but also provides a scalable infrastructure capable of supporting growth. By following the steps outlined, organizations can fortify their deployment processes, embrace the agility of Infrastructure as Code, and maintain a robust and adaptable cloud environment. Implementing this guide's recommendations allows businesses to optimize costs by utilizing AWS services such as Fargate and embracing DevOps practices. The Blue/Green deployment strategy minimizes downtime, ensuring resources are utilized efficiently during transitions. With a focus on DevOps practices and the use of automation tools like AWS Code Pipeline, operational overhead is minimized. CloudFormation templates automate resource provisioning, reducing manual intervention and ensuring consistent and repeatable deployments.
Yitaek Hwang
Software Engineer,
NYDIG
Emmanouil Gkatziouras
Cloud Architect,
egkatzioura.com
Marija Naumovska
Product Manager,
Microtica