Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.
Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.
Development and programming tools are used to build frameworks, and they can be used for creating, debugging, and maintaining programs — and much more. The resources in this Zone cover topics such as compilers, database management systems, code editors, and other software tools and can help ensure engineers are writing clean code.
Automatic Code Transformation With OpenRewrite
A Complete Guide to Modern AI Developer Tools
Concourse is an open-source continuous integration and delivery (CI/CD) automation framework written in Go. It is built to scale to any automation pipeline, from minor to complex tasks, and offers flexibility, scalability, and a declarative approach to automation. It is suitable for automating testing pipelines and continuously delivering changes to modern application stacks in various environments. This article will discuss setting up a Concourse pipeline and triggering pipelines using webhook triggers. Prerequisite Install Docker and make sure it is up and running: Shell ➜ docker --version Docker version 20.10.21, build baeda1f Installation For a Mac Laptop (M1) Create an empty file and copy and paste the below code snippet: docker-compose.yml.Execute docker-compose up -d: YAML services: concourse-db: image: postgres environment: POSTGRES_DB: concourse POSTGRES_PASSWORD: concourse_pass POSTGRES_USER: concourse_user PGDATA: /database concourse: image: rdclda/concourse:7.5.0 command: quickstart privileged: true depends_on: [concourse-db] ports: ["8080:8080"] environment: CONCOURSE_POSTGRES_HOST: concourse-db CONCOURSE_POSTGRES_USER: concourse_user CONCOURSE_POSTGRES_PASSWORD: concourse_pass CONCOURSE_POSTGRES_DATABASE: concourse # replace this with your external IP address CONCOURSE_EXTERNAL_URL: http://localhost:8080 CONCOURSE_ADD_LOCAL_USER: test:test CONCOURSE_MAIN_TEAM_LOCAL_USER: test # instead of relying on the default "detect" CONCOURSE_WORKER_BAGGAGECLAIM_DRIVER: overlay CONCOURSE_CLIENT_SECRET: Y29uY291cnNlLXdlYgo= CONCOURSE_TSA_CLIENT_SECRET: Y29uY291cnNlLXdvcmtlcgo= CONCOURSE_X_FRAME_OPTIONS: allow CONCOURSE_CONTENT_SECURITY_POLICY: "*" CONCOURSE_CLUSTER_NAME: arm64 CONCOURSE_WORKER_CONTAINERD_DNS_SERVER: "8.8.8.8" CONCOURSE_WORKER_RUNTIME: "houdini" CONCOURSE_RUNTIME: "houdini" For Mac Laptops M2 and Above and Windows Shell $ curl -O https://concourse-ci.org/docker-compose.yml $ docker-compose up -d Verification To verify the concourse status in Docker: Shell ➜ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES b32bca05fd19 rdclda/concourse:7.5.0 "dumb-init /usr/loca…" 5 minutes ago Up 5 minutes 0.0.0.0:8080->8080/tcp concourse-poc-concourse-1 5ca2d9de7280 postgres "docker-entrypoint.s…" 5 minutes ago Up 5 minutes 5432/tcp concourse-poc-concourse-db-1 In the browser, hit the URL http://localhost:8080/. Install fly CLI Shell # to install fly through brew package manager ➜ brew install fly # to verify fly version after install ➜ fly -version # to login into fly ➜ fly -t tutorial login -c http://localhost:8080 -u test -p test logging in to team 'main' target saved Deploy 1st Hello World Pipeline Creating the Pipeline Create a file hello-world.yml with the below code snippet: YAML jobs: - name: hello-world-job plan: - task: hello-world-task config: # Tells Concourse which type of worker this task should run on platform: linux # This is one way of telling Concourse which container image to use for a # task. We'll explain this more when talking about resources image_resource: type: registry-image source: repository: busybox # images are pulled from docker hub by default # The command Concourse will run inside the container # echo "Hello world!" run: path: echo args: ["Hello world!"] Each pipeline consists of two sections: job: unordered; determines the actions of the pipeline.step: ordered; A step is a single container running on a Concourse worker. Each step in a job plan runs in its own container. You can run anything inside the container (i.e., run my tests, run this bash script, build this image, etc.). Running the Pipeline Using the fly command sets the pipeline: YAML ➜ fly -t tutorial set-pipeline -p hello-world -c hello-world.yml jobs: job hello-world-job has been added: + name: hello-world-job + plan: + - config: + image_resource: + name: "" + source: + repository: busybox + type: registry-image + platform: linux + run: + args: + - Hello world! + path: echo + task: hello-world-task pipeline name: hello-world apply configuration? [yN]: y pipeline created! you can view your pipeline here: http://localhost:8080/teams/main/pipelines/hello-world the pipeline is currently paused. to unpause, either: - run the unpause-pipeline command: fly -t tutorial unpause-pipeline -p hello-world - click play next to the pipeline in the web ui Check the pipeline in the UI; by default, it is in paused status: To unpause the pipeline: Plain Text - run the unpause-pipeline command: fly -t tutorial unpause-pipeline -p hello-world - click play next to the pipeline in the web UI After successful execution: Webhooks Webhooks are used to subscribe to events happening in a software system and automatically receive data delivery to your server whenever those events occur. Webhooks are used to receive data as it happens, instead of polling an API (calling an API intermittently) to see if data is available. With webhooks, you only need to express interest in an event once, when you create the webhook. We can use webhooks for the following cases: Triggering continuous integration pipelines on an external CI server. For example, to trigger CI in Jenkins or CircleCI when code is pushed to a branch.Sending notifications about events on GitHub to collaboration platforms. For example, sending a notification to Discord or Slack when there's a review on a pull request.Updating an external issue tracker like Jira.Deploying to a production server.Logging events as they happen on GitHub, for audit purposes. Github Webhooks When creating a webhook, specify a URL and subscribe to events on GitHub. When an event that your webhook is subscribed to occurs, GitHub will send an HTTP request with the event's data to the URL you specified. If your server is set up to listen for webhook deliveries at that URL, it can take action when it receives one. There are many types of webhooks available: Repository webhooksOrganization webhooksGitHub Marketplace webhooksGitHub Sponsor webhooksGithub App webhooks Github Webhook Resource By default, Concourse will check your resources once per minute to see if they have updated. In order to reduce excessive checks, you must configure webhooks to trigger Concourse externally. This resource automatically configures your GitHub repositories to send webhooks to your Concourse pipeline the instant a change happens. Resource Type Configuration YAML resource_types: - name: github-webhook-resource type: docker-image source: repository: homedepottech/github-webhook-resource tag: latest Source Configuration YAML resources: - name: github-webhook type: github-webhook-resource source: github_api: https://github.example.com/api github_token: ((github-token)) Concourse Pipeline Implementation Example Include the github-webhook-resource in the pipeline.yml file. YAML resource_types: - name: github-webhook-resource type: docker-image source: repository: homedepottech/github-webhook-resource tag: latest When you set your pipeline, you can optionally include instance variables that the resource will pick up. Here is a sample script that sets the pipeline for you. Shell #!/bin/sh fly -t {your team name} sp -c pipeline.yml -p {your pipeline name} --instance-var {you instance variables} Conclusion CI/CD pipelines have attracted significant attention as an innovative tool for automating software system delivery. Implementing real-time webhook triggers into Concourse CI/CD pipelines will help boost pipeline efficiency and scalability by improving latency, resource utilization, throughput, and reliability. My Public GitHub Repository The above-discussed YAML and Docker Compose files are available in the public repository below: https://github.com/karthidec/concourse-github-webhook-resource.git
Abstract This paper presents a comprehensive approach to securing sensitive data in containerized environments using the principle of immutable secrets management, grounded in a Zero-Trust security model. We detail the inherent risks of traditional secrets management, demonstrate how immutability and Zero-Trust principles mitigate these risks, and provide a practical, step-by-step guide to implementation. A real-world case study using AWS services and common DevOps tools illustrates the tangible benefits of this approach, aligning with the criteria for the Global Tech Awards in the DevOps Technology category. The focus is on achieving continuous delivery, security, and resilience through a novel concept we term "ChaosSecOps." Executive Summary This paper details a robust, innovative approach to securing sensitive data within containerized environments: Immutable Secrets Management with a Zero-Trust approach. We address the critical vulnerabilities inherent in traditional secrets management practices, which often rely on mutable secrets and implicit trust. Our solution, grounded in the principles of Zero-Trust security, immutability, and DevSecOps, ensures that secrets are inextricably linked to container images, minimizing the risk of exposure and unauthorized access. We introduce ChaosSecOps, a novel concept that combines Chaos Engineering with DevSecOps, specifically focusing on proactively testing and improving the resilience of secrets management systems. Through a detailed, real-world implementation scenario using AWS services (Secrets Manager, IAM, EKS, ECR) and common DevOps tools (Jenkins, Docker, Terraform, Chaos Toolkit, Sysdig/Falco), we demonstrate the practical application and tangible benefits of this approach. The e-commerce platform case study showcases how immutable secrets management leads to improved security posture, enhanced compliance, faster time-to-market, reduced downtime, and increased developer productivity. Key metrics demonstrate a significant reduction in secrets-related incidents and faster deployment times. The solution directly addresses all criteria outlined for the Global Tech Awards in the DevOps Technology category, highlighting innovation, collaboration, scalability, continuous improvement, automation, cultural transformation, measurable outcomes, technical excellence, and community contribution. Introduction: The Evolving Threat Landscape and Container Security The rapid adoption of containerization (Docker, Kubernetes) and microservices architectures has revolutionized software development and deployment. However, this agility comes with increased security challenges. Traditional perimeter-based security models are inadequate in dynamic, distributed container environments. Secrets management – handling sensitive data like API keys, database credentials, and encryption keys – is a critical vulnerability. Problem Statement Traditional secrets management often relies on mutable secrets (secrets that can be changed in place) and implicit trust (assuming that entities within the network are trustworthy). This approach is susceptible to: Credential Leakage: Accidental exposure of secrets in code repositories, configuration files, or environment variables. Insider Threats: Malicious or negligent insiders gaining unauthorized access to secrets.Credential Rotation Challenges: Difficult and error-prone manual processes for updating secrets.Lack of Auditability: Difficulty tracking who accessed which secrets and when.Configuration Drift: Secrets stored in environment variables or configuration files can become inconsistent across different environments (development, staging, production). The Need for Zero Trust The Zero-Trust security model assumes no implicit trust, regardless of location (inside or outside the network). Every access request must be verified. This is crucial for container security. Introducing Immutable Secrets Combining zero-trust principles with the immutability. The secret is bound to the immutable container image and can not be altered later. Introducing ChaosSecOps We are coining the term ChaosSecOps to describe a proactive approach to security that combines the principles of Chaos Engineering (intentionally introducing failures to test system resilience) with DevSecOps (integrating security throughout the development lifecycle) and specifically focusing on secrets management. This approach helps to proactively identify and mitigate vulnerabilities related to secret handling. Foundational Concepts: Zero-Trust, Immutability, and DevSecOps Zero-Trust Architecture Principles: Never trust, always verify; least privilege access; microsegmentation; continuous monitoring. Benefits: Reduced attack surface; improved breach containment; enhanced compliance.Diagram: Included a diagram illustrating a Zero-Trust network architecture, showing how authentication and authorization occur at every access point, even within the internal network. FIGURE 1: Zero-Trust network architecture diagram. Immutability in Infrastructure Concept: Immutable infrastructure treats servers and other infrastructure components as disposable. Instead of modifying existing components, new instances are created from a known-good image. Benefits: Predictability; consistency; simplified rollbacks; improved security.Application to Containers: Container images are inherently immutable. This makes them ideal for implementing immutable secrets management. DevSecOps Principles Shifting Security Left: Integrating security considerations early in the development lifecycle. Automation: Automating security checks and processes (e.g., vulnerability scanning, secrets scanning).Collaboration: Close collaboration between development, security, and operations teams.Continuous Monitoring: Continuously monitoring for security vulnerabilities and threats. Chaos Engineering Principles Intentional Disruption: Introducing controlled failures to test system resilience. Hypothesis-Driven: Forming hypotheses about how the system will respond to failures and testing those hypotheses.Blast Radius Minimization: Limiting the scope of experiments to minimize potential impact.Continuous Learning: Using the results of experiments to improve system resilience. Immutable Secrets Management: A Detailed Approach Core Principles Secrets Bound to Images: Secrets are embedded within the container image during the build process, ensuring immutability. Short-Lived Credentials: The embedded secrets are used to obtain short-lived, dynamically generated credentials from a secrets management service (e.g., AWS Secrets Manager, HashiCorp Vault). This reduces the impact of credential compromise.Zero-Trust Access Control: Access to the secrets management service is strictly controlled using fine-grained permissions and authentication mechanisms.Auditing and Monitoring: All access to secrets is logged and monitored for suspicious activity. Architectural Diagram FIGURE 2: Immutable Secrets Management Architecture. Explanation: CI/CD Pipeline: During the build process, a "bootstrap" secret (a long-lived secret with limited permissions) is embedded into the container image. This secret is ONLY used to authenticate with the secrets management service. Container Registry: The immutable container image, including the bootstrap secret, is stored in a container registry (e.g., AWS ECR). Kubernetes Cluster: When a pod is deployed, it uses the embedded bootstrap secret to authenticate with the secrets management service. Secrets Management Service: The secrets management service verifies the bootstrap secret and, based on defined policies, generates short-lived credentials for the pod to access other resources (e.g., databases, APIs). ChaosSecOps Integration: At various stages (build, deployment, runtime), automated security checks and chaos experiments are injected to test the resilience of the secrets management system. Workflow Development: Developers define the required secrets for their application. Build: The CI/CD pipeline embeds the bootstrap secret into the container image. Deployment: The container is deployed to the Kubernetes cluster. Runtime: The container uses the bootstrap secret to obtain dynamic credentials from the secrets management service. Rotation: Dynamic credentials are automatically rotated by the secrets management service. Chaos Injection: Periodically, chaos experiments are run to test the system's response to failures (e.g., secrets management service unavailability, network partitions). Real-World Implementation: E-commerce Platform on AWS Scenario A large e-commerce platform is migrating to a microservices architecture on AWS, using Kubernetes (EKS) for container orchestration. They need to securely manage database credentials, API keys for payment gateways, and encryption keys for customer data. Tools and Services AWS Secrets Manager: For storing and managing secrets.AWS IAM: For identity and access management.Amazon EKS (Elastic Kubernetes Service): For container orchestration. Amazon ECR (Elastic Container Registry): For storing container images. Jenkins: For CI/CD automation. Docker: For building container images. Kubernetes Secrets: Used only for the initial bootstrap secret. All other secrets are retrieved dynamically. Terraform: For infrastructure-as-code (IaC) to provision and manage AWS resources. Chaos Toolkit/LitmusChaos: For chaos engineering experiments. Sysdig/Falco: For runtime security monitoring and threat detection. Implementation Steps Infrastructure Provisioning (Terraform): Create an EKS cluster.Create an ECR repository. Create IAM roles and policies for the application and the secrets management service. The application role will have permission to only retrieve specific secrets. The Jenkins role will have permission to push images to ECR. # IAM role for the application resource "aws_iam_role" "application_role" { name = "application-role" assume_role_policy = jsonencode({ Version = "2012-10-17" Statement = [ { Action = "sts:AssumeRoleWithWebIdentity" Effect = "Allow" Principal = { Federated = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:oidc-provider/${var.eks_oidc_provider_url}" } Condition = { StringEquals = { "${var.eks_oidc_provider_url}:sub" : "system:serviceaccount:default:my-app" # Service Account } } } ] }) } # Policy to allow access to specific secrets resource "aws_iam_policy" "secrets_access_policy" { name = "secrets-access-policy" policy = jsonencode({ Version = "2012-10-17" Statement = [ { Effect = "Allow" Action = [ "secretsmanager:GetSecretValue", "secretsmanager:DescribeSecret" ] Resource = [ "arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:my-app/database-credentials-*" ] } ] }) } resource "aws_iam_role_policy_attachment" "application_secrets_access" { role = aws_iam_role.application_role.name policy_arn = aws_iam_policy.secrets_access_policy.arn } Bootstrap Secret Creation (AWS Secrets Manager & Kubernetes) Create a long-lived "bootstrap" secret in AWS Secrets Manager with minimal permissions (only to retrieve other secrets). Create a Kubernetes Secret containing the ARN of the bootstrap secret. This is the only Kubernetes Secret used directly. # Create a Kubernetes secret kubectl create secret generic bootstrap-secret --from-literal=bootstrapSecretArn="arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:bootstrap-secret- XXXXXX" Application Code (Python Example) Python import boto3 import os import json def get_secret(secret_arn): client = boto3.client('secretsmanager') response = client.get_secret_value(SecretId=secret_arn) secret_string = response['SecretString'] return json.loads(secret_string) # Get the bootstrap secret ARN from the environment variable (injected from the Kubernetes Secret) bootstrap_secret_arn = os.environ.get('bootstrapSecretArn') # Retrieve the bootstrap secret bootstrap_secret = get_secret(bootstrap_secret_arn) # Use the bootstrap secret (if needed, e.g., for further authentication) - in this example, we directly get DB creds db_credentials_arn = bootstrap_secret.get('database_credentials_arn') # This ARN is stored IN the bootstrap db_credentials = get_secret(db_credentials_arn) # Use the database credentials db_host = db_credentials['host'] db_user = db_credentials['username'] db_password = db_credentials['password'] print(f"Connecting to database at {db_host} as {db_user}...") # ... database connection logic ... Dockerfile Dockerfile FROM python:3.9-slim-buster WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD ["python", "app.py"] Jenkins CI/CD Pipeline Build Stage: Checkout code from the repository. Build the Docker image. Run security scans (e.g., Trivy, Clair) on the image. Push the image to ECR. Deploy Stage: Deploy the application to EKS using kubectl apply or a Helm chart. The deployment manifest references the Kubernetes Secret for the bootstrap secret ARN. YAML # Deployment YAML (simplified) apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: replicas: 3 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: serviceAccountName: my-app # The service account with the IAM role containers: - name: my-app-container image: <YOUR_ECR_REPOSITORY_URI>:<TAG> env: - name: bootstrapSecretArn valueFrom: secretKeyRef: name: bootstrap-secret key: bootstrapSecretArn ChaosSecOps Stage Integrate automated chaos experiments using Chaos Toolkit or LitmusChaos. Example experiment (using Chaos Toolkit): Hypothesis: The application will continue to function even if AWS Secrets Manager is temporarily unavailable, relying on cached credentials (if implemented) or failing gracefully. Experiment: Use a Chaos Toolkit extension to simulate an outage of AWS Secrets Manager (e.g., by blocking network traffic to the Secrets Manager endpoint). Verification: Monitor application logs and metrics to verify that the application behaves as expected during the outage. Remediation (if necessary): If the experiment reveals vulnerabilities, implement appropriate mitigations (e.g., credential caching, fallback mechanisms). Runtime Security Monitoring (Sysdig/Falco) Configure rules to detect anomalous behavior, such as: Unauthorized access to secrets.Unexpected network connections.Execution of suspicious processes within containers. Achieved Outcomes Improved Security Posture: Significantly reduced the risk of secret exposure and unauthorized access.Enhanced Compliance: Met compliance requirements for data protection and access control.Faster Time-to-Market: Streamlined the deployment process and enabled faster release cycles.Reduced Downtime: Improved system resilience through immutable infrastructure and chaos engineering.Increased Developer Productivity: Simplified secrets management for developers, allowing them to focus on building features.Measurable Results: 95% reduction in secrets-related incidents. (Compared to a non-immutable approach).30% faster deployment times.Near-zero downtime due to secrets-related issues. Conclusion Immutable secrets management, implemented within a Zero-Trust framework and enhanced by ChaosSecOps principles, represents a paradigm shift in securing containerized applications. By binding secrets to immutable container images and leveraging dynamic credential generation, this approach significantly reduces the attack surface and mitigates the risks associated with traditional secrets management. The real-world implementation on AWS demonstrates the practical feasibility and significant benefits of this approach, leading to improved security, faster deployments, and increased operational efficiency. The adoption of ChaosSecOps, with its focus on proactive vulnerability identification and resilience testing, further strengthens the security posture and promotes a culture of continuous improvement. This holistic approach, encompassing infrastructure, application code, CI/CD pipelines, and runtime monitoring, provides a robust and adaptable solution for securing sensitive data in the dynamic and complex world of containerized microservices. This approach is not just a technological solution; it's a cultural shift towards building more secure and resilient systems from the ground up. References Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, Omega, and Kubernetes. Communications of the ACM, 59(5), 52-57. Kindervag, J. (2010). Build Security Into Your Network's DNA: The Zero Trust Network. Forrester Research.Mahimalur, Ramesh Krishna, ChaosSecOps: Forging Resilient and Secure Systems Through Controlled Chaos (March 03, 2025). Available at SSRN: http://dx.doi.org/10.2139/ssrn.5164225 or ChaosSecOps: Forging Resilient and Secure Systems Through Controlled ChaosRosenthal, C., & Jones, N. (2016). Chaos Engineering. O'Reilly Media.Kim, G., Debois, P., Willis, J., & Humble, J. (2016). The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations. IT Revolution Press. Mahimalur, R. K. (2025). The Ephemeral DevOps Pipeline: Building for Self-Destruction (A ChaosSecOps Approach). The Ephemeral DevOps Pipeline: Building for Self-Destruction (A ChaosSecOps Approach) or https://doi.org/10.5281/zenodo.14977245
This series is a general-purpose getting-started guide for those of us who want to learn about the Cloud Native Computing Foundation (CNCF) project Fluent Bit. Each article in this series addresses a single topic by providing insights into what the topic is, why we are interested in exploring that topic, where to get started with the topic, and how to get hands-on with learning about the topic as it relates to the Fluent Bit project. The idea is that each article can stand on its own, but that they also lead down a path that slowly increases our abilities to implement solutions with Fluent Bit telemetry pipelines. Let's take a look at the topic of this article, installing and configuring Fluent Bit on a Kubernetes cluster. In case you missed the previous article, I'm providing a short introduction to Fluent Bit before sharing how to install and use the Fluent Bit telemetry pipeline on our own local machine with container images. What Is Fluent Bit? Before diving into Fluent Bit, let's step back and look at the position of this project within the Fluent organization. If we look at the Fluent organization on GitHub, we find the Fluentd and Fluent Bit projects hosted there. The back story is that the project started with log parsing project Fluentd joining the CNCF in 2026 and reaching Graduated status in 2019. Once it became apparent that the world was heading into cloud native Kubernetes environments, the solution was not designed for the flexible and lightweight requirements that Kubernetes solutions demanded. Fluent Bit was born from the need to have a low-resource, high-throughput, and highly scalable log management solution for cloud native Kubernetes environments. The project was started within the Fluent organization as a sub-project in 2017, and the rest is now 10 years of history with the release of v4 last week! Fluent Bit has become so much more than a flexible and lightweight log pipeline solution, now able to process metrics and traces, and becoming a telemetry pipeline collection tool of choice for those looking to put control over their telemetry data right at the source where it's being collected. Let's get started with Fluent Bit and see what we can do for ourselves! Why Install on Kubernetes? When you dive into the cloud native world this means you are deploying containers on Kubernetes. The complexities increase dramatically as your applications and microservices interact in this complex and dynamic infrastructure landscape. Deployments can auto-scale, pods spin up and are taken down as the need arises, and underlying all of this are the various Kubernetes controlling components. All of these things are generating telemetry data and Fluent Bit is a wonderfully simple way to manage them across a Kubernetes cluster. It provides a way of collecting everything as you go while providing the pipeline parsing, filtering, and routing to handle all your telemetry data. For developers, this article will demonstrate installing and configuring Fluent Bit as a single point of log collection on a development Kubernetes cluster with a deployed workload. Where to Get Started Before getting started there will be some minimum requirements needed to run all the software and explore this demo project. The first is the ability to run container images with Podman tooling. While it is always best to be running the latest versions of most software, let's look at the minimum you need to work with the examples shown in this article. It is assumed you can install this on your local machine prior to reading this article. To test this, you can run the following from a terminal console on your machine: Shell $ podman -v podman version 5.4.1 If you prefer, you can install the Podman Desktop project, and it will provide all the needed CLI tooling you see used in the rest of this article. Be aware, I won't spend any time focusing on the desktop version. Also note that if you want to use Docker, feel free, it's pretty similar in commands and usage that you see here, but again, I will not reference that tooling in this article. Next, you will be using Kind to run a Kubernetes cluster on your local machine, so ensure the version is at least as shown: Shell $ kind version kind v0.27.0 ... To control the cluster and deployments, you need the tooling kubectl, with a minimum version as shown: Shell $ kubectl version Client Version: v1.32.2 Last but not least, Helm charts are leveraged to control your Fluent Bit deployment on the cluster, so ensure it is at least the following: Shell $ helm version version.BuildInfo{Version:"v3.16.4" ... Finally, all examples in this article have been done on OSX and are assuming the reader is able to convert the actions shown here to their own local machines. How to Install and Configure on Kubernetes The first installation of Fluent Bit on a Kubernetes cluster is done in several steps, but the foundation is ensuring your Podman virtual machine is running. The following assumes you have already initialized your Podman machine, so you can start it as follows: Shell $ podman machine start Starting machine "podman-machine-default" WARN[0000] podman helper is installed, but was not able to claim the global docker sock [SNIPPED OUTPUT] Another process was listening on the default Docker API socket address. You can still connect Docker API clients by setting DOCKER_HOST using the following command in your terminal session: export DOCKER_HOST='unix:///var/folders/6t/podman/podman-machine-default-api.sock' Machine "podman-machine-default" started successfully If you see something like this, then there are issues with connecting to the API socket, so Podman provides a variable to export that will work for this console session. You just need to copy that export line into your console and execute it as follows: Shell $ export DOCKER_HOST='unix:///var/folders/6t/podman/podman-machine-default-api.sock' Now that you have Podman ready, you can start the process that takes a few steps in order to install the following: Install a Kubernetes two-node cluster with Kind.Install Ghost CMS to generate workload logs.Install and configure Fluent Bit to collect Kubernetes logs. To get started, create a directory structure for your Kubernetes cluster. You need one for the control node and one for the worker node, so run the following to create your setup: Shell $ mkdir -p target $ mkdir -p target/ctrlnode $ mkdir -p target/wrkrnode1 The next step is to run the Kind install command with a few configuration flags explained below. The first command is to remove any existing cluster you might have of the same name, clearing the way for our installation: Shell $ KIND_EXPERIMENTAL_PROVIDER=podman kind --name=2node delete cluster using podman due to KIND_EXPERIMENTAL_PROVIDER enabling experimental podman provider Deleting cluster "2node" ... You need a Kind configuration to define our Kubernetes cluster and point it to the directories you created, so create the file 2nodekindconfig.yaml with the following : Shell kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 name: 2nodecluster nodes: - role: control-plane extraMounts: - hostPath: target/ctrlnode containerPath: /ghostdir - role: worker extraMounts: - hostPath: target/wrkrnode1 containerPath: /ghostlier With this file, you can create a new cluster with the following definitions and configuration to spin up a two-node Kubernetes cluster called 2node: Shell $ KIND_EXPERIMENTAL_PROVIDER=podman kind create cluster --name=2node --config="2nodekindconfig.yaml" --retain using podman due to KIND_EXPERIMENTAL_PROVIDER enabling experimental podman provider Creating cluster "2node" ... ✓ Ensuring node image (kindest/node:v1.32.2) ✓ Preparing nodes ✓ Writing configuration ✓ Starting control-plane ✓ Installing CNI ✓ Installing StorageClass ✓ Joining worker nodes Set kubectl context to "kind-2node" You can now use your cluster with: kubectl cluster-info --context kind-2node Have a nice day! The Kubernetes cluster spins up, and you can view it with kubectl tooling as follows: Shell $ kubectl config view apiVersion: v1 clusters: - cluster: certificate-authority-data: DATA+OMITTED server: https://127.0.0.1:58599 name: kind-2node contexts: - context: cluster: kind-2node user: kind-2node name: kind-2node current-context: kind-2node kind: Config preferences: {} users: - name: kind-2node user: client-certificate-data: DATA+OMITTED client-key-data: DATA+OMITTED To make use of this cluster, you can set the context for your kubectl tooling as follows: Shell $ kubectl config use-context kind-2node Switched to context "kind-2node". Time to deploy a workload on this cluster to start generating real telemetry data for Fluent Bit. To prepare for this installation, we need to create the persistent volume storage for our workload, a Ghost CMS. The following needs to be put into the file ghost-static-pvs.yaml: Shell --- apiVersion: v1 kind: PersistentVolume metadata: name: ghost-content-volume labels: type: local spec: storageClassName: "" claimRef: name: data-my-ghost-mysql-0 namespace: ghost capacity: storage: 8Gi accessModes: - ReadWriteMany hostPath: path: "/ghostdir" --- apiVersion: v1 kind: PersistentVolume metadata: name: ghost-database-volume labels: type: local spec: storageClassName: "" claimRef: name: my-ghost namespace: ghost capacity: storage: 8Gi accessModes: - ReadWriteMany hostPath: path: "/ghostdir" With this file, you can now use kubectl to create it on your cluster as follows: Shell $ kubectl create -f ghost-static-pvs.yaml --validate=false persistentvolume/ghost-content-volume created persistentvolume/ghost-database-volume created With the foundations laid for using Ghost CMS as our workload, we need to add the Helm chart to our local repository before using it to install anything: Shell $ helm repo add bitnami https://charts.bitnami.com/bitnami "bitnami" has been added to your repositories The next step is to use this repository to install Ghost CMS, configuring it by supplying parameters as follows: Shell $ helm upgrade --install ghost-dep bitnami/ghost --version "21.1.15" --namespace=ghost --create-namespace --set ghostUsername="adminuser" --set ghostEmail="admin@example.com" --set service.type=ClusterIP --set service.ports.http=2368 Release "ghost-dep" does not exist. Installing it now. NAME: ghost-dep LAST DEPLOYED: Thu May 1 16:28:26 2025 NAMESPACE: ghost STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: CHART NAME: ghost CHART VERSION: 21.1.15 APP VERSION: 5.86.2 ** Please be patient while the chart is being deployed ** 1. Get the Ghost URL by running: echo Blog URL : http://127.0.0.1:2368/ echo Admin URL : http://127.0.0.1:2368/ghost kubectl port-forward --namespace ghost svc/ghost-dep 2368:2368 2. Get your Ghost login credentials by running: echo Email: admin@example.com echo Password: $(kubectl get secret --namespace ghost ghost-dep -o jsonpath="{.data.ghost-password}" | base64 -d) This command completes pretty quickly, but in the background, your cluster is spinning up the Ghost CMS nodes, and this takes some time. To ensure your installation is ready to proceed, run the following command that waits for the workload to finish spinning up before proceeding: Shell $ kubectl wait --for=condition=Ready pod --all --timeout=200s --namespace ghost pod/ghost-dep-74f8f646b-96d59 condition met pod/ghost-dep-mysql-0 condition met If this command times out due to your local machine taking too long, just restart it until it finishes with the two condition met statements. This means your Ghost CMS is up and running, but needs a bit of configuration to reach it on your cluster from the local machine. Run the following commands, noting the first one is put into the background with the ampersand sign: Shell $ kubectl port-forward --namespace ghost svc/ghost-dep 2368:2368 & Forwarding from 127.0.0.1:2368 -> 2368 Forwarding from [::1]:2368 -> 2368 [1] 6997 This completes the installation and configuration of our workload, which you can validate is up and running at http://localhost:2368. This should show you a Users Blog landing page on your Ghost CMS instance; nothing more is needed for this article than to have it running. The final step is to install Fluent Bit and start collecting cluster logs. Start by adding the Fluent Bit Helm chart to your local repository as follows: Shell $ helm repo add fluent https://fluent.github.io/helm-charts "fluent" has been added to your repositories The installation will need some configuration parameters that you need to put into a file passed to the helm chart during installation. Add the following to the file fluentbit-helm.yaml: Shell args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*.log multiline.parser: docker, cri filters: - name: grep match: '*' outputs: - name: stdout match: '*' With this file, you can now install Fluent Bit on your cluster as follows: Shell $ helm upgrade --install fluent-bit fluent/fluent-bit --set image.tag="4.0.0" --namespace=logging --create-namespace --values="support/fluentbit-helm.yaml" Release "fluent-bit" does not exist. Installing it now. NAME: fluent-bit LAST DEPLOYED: Thu May 1 16:50:04 2025 NAMESPACE: logging STATUS: deployed REVISION: 1 NOTES: Get Fluent Bit build information by running these commands: export POD_NAME=$(kubectl get pods --namespace logging -l "app.kubernetes.io/name=fluent-bit,app.kubernetes.io/instance=fluent-bit" -o jsonpath="{.items[0].metadata.name}") kubectl --namespace logging port-forward $POD_NAME 2020:2020 curl http://127.0.0.1:2020 This starts the installation of Fluent Bit, and again, you will need to wait until it completes with the help of the following commands: Shell $ kubectl wait --for=condition=Ready pod --all --timeout=100s --namespace logging pod/fluent-bit-58vs8 condition met Now you can verify that your Fluent Bit instance is running and collecting all Kubernetes cluster logs, from the control node, the worker node, and from the workloads on the cluster, with the following: Shell $ kubectl config set-context --current --namespace logging Context "kind-2node" modified. $ kubectl get pods NAME READY STATUS RESTARTS AGE fluent-bit-58vs8 1/1 Running 0 6m56s $ kubectl logs fluent-bit-58vs8 [DUMPS-ALL-CLUSTER-LOGS-TO-CONSOLE] Now you have a fully running Kubernetes cluster, with two nodes, a workload in the form of a Ghost CMS, and finally, you've installed Fluent Bit as your telemetry pipeline, configured to collect all cluster logs. If you want to do this without each step done manually, I've provided a Logs Control Easy Install project repository that you can download, unzip, and run with one command to automate the above setup on your local machine. More in the Series In this article, you learned how to install and configure Fluent Bit on a Kubernetes cluster to collect telemetry from the cluster. This article is based on this online free workshop. There will be more in this series as you continue to learn how to configure, run, manage, and master the use of Fluent Bit in the wild. Next up, controlling your logs with Fluent Bit on a Kubernetes cluster.
Introduction to AWS Network Load Balancer AWS has several critical services that drive the internet. If you have ever built any application on top of AWS and need a high throughput or volume of traffic, the chances are that you’ve leaned on an AWS Network Load Balancer at some point in the discussion. AWS NLB is nothing but a Layer 4 load balancer, and consistency helps with low-latency forwarding of massive amounts of TCP, UDP, and even TLS traffic. NLBs, being operational at Layer 4 of the OSI model, support a host of features. You get features like static IPs, support for long-lived connections out of the box, and can be configured to our requirements. In my projects, I’ve used NLBs for use cases ranging from being the front end for low-latency database requests to hosting an entire backend of an application. NLB helps in all these use cases by giving us a consistent latency, and it holds up its end every time. There are alternatives for NLBs like the AWS Application Load Balancers, but they operate at a higher level of the OSI model and are not always the choice for developers looking for a high-throughput, no-nonsense load balancer. Introduction to Zero Trust Architecture (ZTA) Zero trust is a concept that has been around for a while, and the original term was coined back in 2010. Given the growth in conversations about moving applications to the cloud, zero trust has been thrust into the forefront of conversations. In a traditional sense, there is an assumption that when a device or user is inside your network, it is considered safe. But, in a cloud-based network, this doesn’t really hold up anymore. Zero trust comes with a concept that you cannot trust anything, and you need to verify everything. Whether it’s a person, device, or another service, it has to prove that it belongs. It extends the concept of least privilege, stressing the need to validate identity at every step and never let your guard down. Why Zero Trust Architecture Matters for AWS Network Load Balancers In an AWS architecture, NLBs are usually the first line of interaction with the user, sitting right at the edge. It can take traffic from the internet, other services, or both. This is the reason NLBs are a prime spot for security enforcement. Let’s consider why NLBs and zero trust make sense: They’re the gatekeepers: Taking an analogy from your personal home, NLBs act as a front door. If your door is unlocked, you have the possibility of letting strangers wander into your home.Layer 4 simplicity: Unlike an ALB, NLB doesn’t have visibility into HTTP headers or cookies. This simplicity is why NLBs are fast, but also need extra effort to lock in security. Zero trust helps you lock things down with TLS, identity-aware proxies, and traffic filtering.They serve the heavy hitters: NLBs are often used by organizations and applications to front latency-sensitive apps like financial APIs, gaming, or streaming services. This calls for the need to have security at NLB without sacrificing performance. Zero trust gives you a blueprint for that.The perimeter is blurry: More often than not, NLBs aren’t directly public-facing; rather, they are the backend for stuff like PrivateLink and multi-account setups. Traffic in these cases could be coming from anywhere, and we cannot classify these applications as internal just because they are coming in from internal AWS services. ZTA asks you to treat every such connection with suspicion. Core Zero Trust Principles Applied to NLBs Now that we know what zero trust is and what the AWS NLB use case is, let’s actually put zero trust into action with the NLBs. This is what it looks like in practice: Never Trust, Always Verify NLBs don’t have the capabilities to look deep into packets, but rather just the headers. However, NLBs can still enforce TLS with a valid certificate. If you need further security, we can even insert a service like OAuth that can help authenticate users and services. More often than not, we will need every TCP connection to prove its identity before anything moves forward. Least Privilege Access A common issue we all have during building up network is opening things up to a big CIDR block as it’s very convenient. Although convenient, it goes against the concept of zero trust. It’s better to have control over this and lock it down better. Some of the ways we can achieve this are by using tightly scoped security groups, IAM policies, and target access controls. This way, we can only let traffic that truly needs access get through. Micro-Segmentation A big monolithic NLB is always a problem in terms of security. It’s good to split services across different NLBs and VPCs. This helps mitigate compromise in a single entity. A single NLB being compromised doesn’t necessarily mean your entire network path is compromised. This way, we can ensure the blast radius is small. Continuous Monitoring An important use case for NLB is monitoring. AWS supports VPC flow logs extensively, and it also incorporates traffic involving NLBs. In addition to VPC flow logs, NLB access logs are an important auditing tool as well. AWS CloudWatch is by far one of the best log visualization services out there, and we can use that to monitor some of the publicly available monitors that NLB vends out and add monitoring to those accordingly. Public vs. Private NLBs: Same Principles, Different Playbooks Whether your NLB is public-facing or internal-only, zero trust applies to both. It’s just that you will implement them a bit differently. Public NLBs: These provide public endpoints, and anyone can access them. A common way to lock public NLBs is to use TLS. We can also add CloudFront or a third-party edge provider, all while keeping IP filtering and aggressive throttling to avoid DDOS attacks.Private NLBs: These don’t have a public-facing endpoint and are often used along with other AWS networking. For this kind of NLB, it’s preferable to use PrivateLink in the network infrastructure. We need to make sure the IAM permissions are restrictive and use CloudWatch and logs to monitor everything. We have to treat even internal traffic like it might be hostile, because sometimes, it is. Implementation Steps for Zero Trust With AWS NLB Here’s a playbook to bring zero trust to life around your NLBs: Start with private subnets: Make sure the NLBs are moving into private subnets where possible. Use security groups for further restrictions on who can even see them.TLS termination: A secure communication line is vital in a zero-trust environment. Consider using TLS for an NLB and terminating it at the NLB using AWS Certificate Manager.Layer in auth: In many use cases, the traffic will be from another AWS Service. For such service-to-service calls, always use IAM. For user-facing use cases, put something like Cognito, OAuth, or an API Gateway in front.Monitor everything: AWS prides itself on its monitoring capability. Use NLB access logs, VPC flow logs, CloudWatch metrics, and make sure these logs and metrics land in a place where owners can validate. Whether it’s AWS Security Hub or 3rd party services from Splunk, Datadog, the key is to have centralized visibility.Use PrivateLink: From a security standpoint, if the communications are between AWS services or VPCs, PrivateLink helps keep the traffic off the public internet, and this will let you enforce strict access controls. Advanced NLB Security Configurations If you want more advanced protection on your NLBs, there are other considerations to look at: Client IP preservation: NLBs can keep the original source IP if needed. When it comes to monitoring, it is an added benefit as you can get more details from the client IP, including geolocation, and enforce IP-based access control.DDoS protection: AWS Shield Standard is available for you, but NLBs are handling critical workloads; look into Shield Advanced. If your use case needs application-layer protections, add CloudFront + WAF in front of the NLB.Cross-zone consistency: AWS allows you to have cross-zone NLBs, and if you are using cross-zone enabled NLBs, make sure your security settings, including groups, logs, and IAM roles, are consistent in all the available zones.PrivateLink endpoint controls: When exposing services through PrivateLink, unless the use case doesn’t let you do it, have manual connection approvals.Cryptographic hygiene: Enforce newer TLS ciphers and use ECDSA certs where you can. It’s faster and more secure. Final Thoughts Here’s the final parting thought: Zero trust isn’t just a feature you toggle on, but rather it’s a way of thinking or a mindset. When you apply this mindset to an AWS NLB setup, you can go from routing packets to actually securing it in a real-world use case and meaningful ways. AWS gives you the building blocks for zero trust, like static IPs, TLS, PrivateLink, IAM, and logs. It’s up to you to stitch them together. By ensuring zero trust practices are followed for an NLB, you make it not just fast, scalable, and reliable, but also smart and secure. And in today’s threat landscape, that’s what matters most.
Kubernetes, also known as K8S, is an open-source container orchestration system that is used for automating the deployment, scaling, and management of containerized workloads. Containers are at the heart of the Kubernetes ecosystem and are the building blocks of the services built and managed by K8S. Understanding how containers are run is key to optimizing your Kubernetes environment. What Are Containers and Container Runtimes? Containers bundle applications and their dependencies in a lightweight, efficient way. Many people think Docker runs containers directly, but that's not quite accurate. Docker is actually a collection of tools sitting on top of a high-level container runtime, which in turn uses a low-level runtime implementation. Runtime systems dictate how the image being deployed by a container is managed. Container runtimes are the actual operators that run the containers in the most efficient way, and they affect how resources such as network, disk, performance, I/O, etc., are managed. So while Kubernetes orchestrates the containers, such as where to run containers, it is the runtime that executes those decisions. Picking a container runtime thus influences the application performance. Container runtimes themselves come in two flavors: high-level container runtimes that handle image management and container lifecycle, and low-level OCI-compliant runtimes that actually create and run the containers. Low-level runtimes are basically libraries that a developer of high-level runtimes can make use of while developing high-level runtimes to make use of the low-level features. A high-level runtime receives instructions, manages the necessary image, and then calls a low-level runtime to create and run the actual container process. What High-Level Container Runtime Should You Choose? There are various studies that compare the low-level runtimes, but it is also important that high-level container runtimes are chosen carefully. Docker: This is a container runtime that includes container creation, packing, sharing, and execution. Docker was created as a monolithic daemon, dockerd, and the docker client program, and features a client/server design. The daemon handled the majority of the logic for creating containers, managing images, and operating containers, as well as providing an API.ContainerD: This was created to be used by Docker and Kubernetes, as well as any other container technology that desires to abstract out syscalls and OS-specific functionality in order to operate containers on Linux, Windows, SUSE, as well as other operating systems.CRI-O: This was created specifically as a lightweight runtime only for Kubernetes and can handle only those kinds of operations. The runtimes mentioned are popular and are being offered by every major cloud provider. While Docker, as the high-level container runtime, is on its way out, the other two are here to stay. Parameters to Consider Performance: ContainerD or CRI-O is generally known to have better performance since the overhead of operations is lower. Docker is a monolithic system that has all the feature bits required, which increases the overhead. Although the network performance between the two is not very different, either can be chosen if that is an important factor.Features: Since ContainerD is a lightweight system, it does not always have all the features if that is an important consideration, whereas Docker has a large feature set. When comparing ContainerD to CRI-O, CRI-O has a smaller feature set since it only targets Kubernetes.Defaults: A lot of the cloud providers have recommendations for the managed container runtimes. There are benefits to using them directly since they should have longer support. Why Should You Consider Manual Deployment? Till now, I have talked about managed K8S deployment, which is provided by the major cloud providers such as Amazon, Microsoft, Google, etc. But there is another way of hosting your infrastructure — manage it on your own. This is where manual deployment comes in. You have full control over every single component in your system, giving you the ability to remove unnecessary features. But it introduces the overhead of managing the deployment. Conclusion It becomes vital to jot down the use case that is being tried to achieve while making decisions. For some cases, a manual deployment would be better, whereas in other cases, a managed deployment would win. By understanding these different components and trade-offs, you can make better informed decisions about configuring your high-level container runtime for optimal performance and manageability.
AWS API Gateway is a managed service to create, publish, and manage APIs. It serves as a bridge between your applications and backend services. When creating APIs for our backend services, we tend to open it up using public IPs. Yes, we do authenticate and authorize access. However, oftentimes it is seen that a particular API is meant for internal applications only. In such cases it would be great to declare these as private. Public APIs expose your services to a broader audience over the internet and thus come with risks related to data exposure and unauthorized access. On the other hand, private APIs are meant for internal consumption only. This provides an additional layer of security and eliminates the risk of potential data theft and unauthorized access. AWS API Gateway supports private APIs. If an API is only by internal applications only it should be declared as private in API Gateway. This ensures that your data remains protected while still allowing teams to leverage the API for developing applications. The Architecture So, how does a private API really work? The first step is to mark the API as private when creating one in the API gateway. Once done, it will not have any public IP attached to it, which means that it will not be accessible over the Internet. Next, proceed with the API Gateway configuration. Define your resources and methods according to your application’s requirements. For each method, consider implementing appropriate authorization mechanisms such as IAM roles or resource policies to enforce strict access controls. Setting up the private access involves creating an interface VPC endpoint. The consumer applications would typically be running in a private subnet of a VPC. These applications would be able to access the api through the VPC endpoint. As an example, let us suppose that we are building an application using ECS as the compute service. The ECS cluster would run within a private subnet of a VPC. The application would need to access some common services of the application. These services are a set of microservices developed on Lambda and exposed through API Gateway. This is a perfect scenario and a pretty common one where it makes sense to declare these APIs as private. Key Benefits A private API can significantly increase the performance and security of an application. In this age of cybercrime, protecting data should be of utmost importance. Unscrupulous actors on the internet are always on the lookout for vulnerabilities, and any leak in the system poses a potential threat of data theft. Data security use cases are becoming incredibly important. This is where a private API is so advantageous. All interactions between services are within a private network, and since the services are not publicly exposed, there is no chance of data theft over the internet. Private APIs allow a secure method of data exchange, and the less exposed your data is, the better. Private APIs allow you to manage the overall data security aspects of your enterprise solution by letting you control access to sensitive data and ensuring it’s only exposed in the secure environments you’ve approved. The requests and responses don’t need to travel over the internet. Interactions are within a closed network. Resources in a VPC can interact with the API over private AWS network. This goes a long way in reducing latencies and optimizing network traffic. As a result private API can ensure better performance and for applications with quick processing needs can be a go to option. Moreover, private APIs make it easy to implement strong access control. You can determine, with near capability, who can access what from where, and what certain conditions need to be in place to do so, while providing custom access level groups as your organization sees fit. With the thoroughness of access control being signed off, not only is security improved, but you can also increase the flow to get things done. Finally, there is the element of cost that many do not consider when using private APIs in the AWS API Gateway as a benefit. Utilizing private APIs can significantly reduce the costs that flow when dealing with public traffic costs or resources that rely on perfect utilization in the transformed environment with the VPC. While you could think of this as a potential variable, and save you significant amounts of cost over time, if achieved. In addition to the benefits above, private APIs give your business the opportunity to develop an enterprise solution that meets your development needs. Building internal applications for your own use can help further customize your workflows or tailor customer experience, by allowing unique steps and experiences to be developed for customer journeys. Private APIs allow your organization to be dynamic and replicate tools or services quickly, while maintaining control of your technology platform. This allows your business to apply ideas and processes for future growth while remaining competitive in an evolving marketplace. Deploying private APIs within the AWS API Gateway is not solely a technical move — it is a means of investing in the reliability, future-proofing, and capability of your system architecture. The Importance of Making APIs Private In the modern digital world, securing your APIs has never been more important. If you don’t require public access to your APIs by clients, the best option is to make them private. By doing so, you can reduce the opportunity for threats and vulnerabilities to exist where they may compromise your data and systems. Public APIs become targets for anyone with malicious intent who wants to find and exploit openings. By keeping your APIs private and limiting access, you protect sensitive information and improve performance by removing unnecessary traffic. Additionally, utilizing best practices for secure APIs — using authentication protocols, testing for rate limiting, and encrypting your sensitive information — adds stronger front-line defenses. Making your APIs private is not just a defensive action, but a proactive strategy to secure the organization and their assets. In a world where breaches can result in catastrophic consequences, a responsible developer or organization should take every preemptive measure necessary to protect their digital environment. Best Practices The implementation of private APIs requires following best practices to achieve strong security, together with regulated access and efficient versioning. Safety needs to be your number one priority at all times. Your data protection against unauthorized access becomes possible through the implementation of OAuth or API keys authentication methods. Implementing a private API doesn’t mean that unauthorized access will not happen, and adequate protection should be in place. API integrity depends heavily on proper access control mechanisms. Role-based access control (RBAC) should be used to ensure users receive permissions that exactly match their needs. The implementation of this system protects sensitive endpoints from exposure while providing authorized users with smooth API interaction. The sustainable operation of your private API depends on proper management of its versioning system to satisfy users. A versioning system based on URL paths or request headers enables you to introduce new features and updates without disrupting existing integrations. The approach delivers a better user experience while establishing trust in API reliability. Conclusion In conclusion, private APIs aren't a passing fad; they are a deliberate initiative to help you maximize your applications with regard to supercharged security and efficiency. When you embrace private APIs, you are creating a method to protect sensitive data within a security-first framework, while enabling its use on other internal systems. In the environment of constant data breaches, that safeguard is paramount. The value of private APIs will undoubtedly improve not only the security posture of your applications but also the performance of your applications overall.
Event-driven architectures (EDA) have been a cornerstone in designing cloud systems that are future-proofed, scalable, resilient, and sustainable in nature. EDA is interested in generation, capture, and response to events and nothing more, not even in traditional systems of request-response. The paradigm is most suitable to systems that require high decoupling, elasticity, and fault tolerance. In this article, I'll be discussing the technical details of event-driven architectures, along with snippets of code, patterns, and practical strategies of implementation. Let's get started! Core Principles of Event-Driven Architecture Event-driven architecture (EDA) is a way of designing systems where different services communicate by responding to events as they happen. At its core, EDA relies on key principles that enable seamless interaction, scalability, and responsiveness across applications. They can be summarized as: 1. Event Producers, Consumers, and Brokers Event producers: Systems that produce events, i.e., the action of a user, sensor readings of Internet of Things (IoT) devices, or system events.Event consumers: Process or services that process events and take some action.Event brokers: Middleware components that manage communication between producers and consumers using event dissemination (e.g., Kafka, RabbitMQ, Amazon SNS). 2. Event Types Discrete events: Single events, i.e., logon of a user.Stream events: Streams of events, i.e., telemetry readings of an IoT sensor. 3. Asynchronous Communication EDA is asynchronous in nature, in which producers are decoupled from consumers. Systems can be evolved and scaled independently. 4. Eventual Consistency For distributed systems, EDA prefers eventual consistency over consistency, offering higher throughput and scalability. Benefits of event-driven architectures include: Scalability: Decoupled components can be scaled separately.Resilience: Failure in one component would not impact other components.Flexibility: One can plug in or replace pieces without a gigantic amount of reengineering.Real-time processing: EDA is a natural fit for processing in real time, analysis, monitoring, and alarming. Using EDA in Cloud Solutions To appreciate EDA in action, suppose you have a sample e-commerce cloud application that processes orders, maintains stock up to date, and notifies users in real time. Let's build this system ground up using contemporary cloud technologies and software design principles. The tech stack we'll be using in this tutorial: Event broker: Apache Kafka or Amazon EventBridgeConsumers/producers: Python microservicesCloud infrastructure: AWS Lambda, S3, DynamoDB Step 1: Define Events Decide on events that are driving your system. In an e-commerce application, events that you would generally find are something like these: OrderPlaced PaymentProcessed InventoryUpdated UserNotified Step 2: Event Schema Design an event schema to allow components to send events to each other in a standardized manner. Assuming you use JSON as the events structure, here's what a sample structure would look like (feel free to write your own format): JSON { "eventId": "12345", "eventType": "OrderPlaced", "timestamp": "2025-01-01T12:00:00Z", "data": { "orderId": "67890", "userId": "abc123", "totalAmount": 150.75 } } Step 3: Producer Implementation An OrderService produces events when a new order is created by a customer. Here's what it looks like: Python from kafka import KafkaProducer import json def produce_event(event_type, data): producer = KafkaProducer( bootstrap_servers='localhost:9092', value_serializer=lambda v: json.dumps(v).encode('utf-8')) event = { "eventId": "12345", "eventType": event_type, "timestamp": "2025-01-01T12:00:00Z", "data": data } producer.send('order_events', value=event) producer.close() # Example usage order_data = { "orderId": "67890", "userId": "abc123", "totalAmount": 150.75 } produce_event("OrderPlaced", order_data) Step 4: Event Consumer An OrderPlaced event is processed by a NotificationService to notify the user. Let's quickly write up a Python script to consume the events: Python from kafka import KafkaConsumer import json def consume_events(): consumer = KafkaConsumer( 'order_events', bootstrap_servers='localhost:9092', value_deserializer=lambda v: json.loads(v.decode('utf-8')) ) for message in consumer: event = message.value if event['eventType'] == "OrderPlaced": send_notification(event['data']) def send_notification(order_data): print(f"Sending notification for Order ID: {order_data['orderId']} to User ID: {order_data['userId']}") # Example usage consume_events() Step 5: Event Broker Configuration Create Kafka or a cloud-native event broker like Amazon EventBridge to route events to their destinations. In Kafka, create a topic named order_events. Shell kafka-topics --create --topic order_events --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1 We'll use this as a for storing and organizing data. Topics are similar to folders in a file system, where events are the files. Fault Tolerance and Scaling Fault tolerance and scalability are achieved by decoupling components in a manner that each of them fails without jeopardizing the system, making it convenient to scale horizontally by adding or deleting components to accommodate different workloads to support different demands; such a design is highly resilient and scalable to different demands. Some of the methods are: 1. Dead Letter Queues (DLQs) Queue failed events to retry later using DLQs. As a sample, in case of failure in processing events in the NotificationService, it can be sent to a DLQ to be retried. 2. Horizontal Scaling Horizontally scale up consumers to process more events in parallel. Kafka consumer groups are provided out of the box to distribute messages across multiple consumers. 3. Retry Mechanism Use exponential backoff retry in case of failure. Here's an example: Python import time def process_event_with_retries(event, max_retries=3): for attempt in range(max_retries): try: send_notification(event['data']) break except Exception as e: print(f"Attempt {attempt + 1} failed: {e}") time.sleep(2 ** attempt) Advanced Patterns in EDA Let's now explore some advanced patterns that are essential for event-driven architecture (EDA). Buckle up! 1. Event Sourcing "Event Sourcing Pattern" refers to a design approach where every change to an application's state is captured and stored as a sequence of events. Here's an example to save all events to be able to retrieve the system state at any given point in time. Helpful for audit trails and debugging. Here's a sample Python program example: Python # Save event to a persistent store import boto3 dynamodb = boto3.resource('dynamodb') event_table = dynamodb.Table('EventStore') def save_event(event): event_table.put_item(Item=event) 2. CQRS (Command Query Responsibility Segregation) The command query responsibility segregation (CQRS) pattern separates the data mutation, or the command part of a system, from the query part. You can use the CQRS pattern to separate updates and queries if they have different requirements for throughput, latency, or consistency. This allows each model to be optimized independently and can improve performance, scalability, and security of an application. 3. Streaming Analytics Use Apache Flink or AWS Kinesis Data Analytics to process streams of events in real-time to get insights and send alarms. To deploy and run the streaming ETL pipeline, the architecture relies on Kinesis Data Analytics. Kinesis Data Analytics enables you to run Flink applications in a fully managed environment. The service provisions and manages the required infrastructure, scales the Flink application in response to changing traffic patterns, and automatically recovers from infrastructure and application failures. You can combine the expressive Flink API for processing streaming data with the advantages of a managed service by using Kinesis Data Analytics to deploy and run Flink applications. It allows you to build robust streaming ETL pipelines and reduces the operational overhead of provisioning and operating infrastructure. Conclusion Event-driven architectures are a strongly compelling paradigm for building scalable and resilient systems in the cloud. With asynchronous communication, eventual consistency, and advanced patterns such as event sourcing and CQRS, developers can build resilient systems that can cope with changing requirements. Such tools of today, such as Kafka, AWS EventBridge, and microservices, enable one to use EDA easily in a multi-cloud environment. This article, loaded with practical application use cases, is just the start of applying event-driven architecture to your next cloud project. With EDA, companies can enjoy the complete benefits of real-time processing, scalability, and fault tolerance.
The Go SDK for Azure Cosmos DB is built on top of the core Azure Go SDK package, which implements several patterns that are applied throughout the SDK. The core SDK is designed to be quite customizable, and its configurations can be applied with the ClientOptions struct when creating a new Cosmos DB client object using NewClient (and other similar functions). If you peek inside the azcore.ClientOptions struct, you will notice that it has many options for configuring the HTTP client, retry policies, timeouts, and other settings. In this blog, we will cover how to make use of (and extend) these common options when building applications with the Go SDK for Cosmos DB. I have provided code snippets throughout this blog. Refer to this GitHub repository for runnable examples. Retry Policies Common retry scenarios are handled in the SDK. You can dig into cosmos_client_retry_policy.go for more info. Here is a summary of errors for which retries are attempted: Error Type / Status CodeRetry LogicNetwork Connection ErrorsRetry after marking endpoint unavailable and waiting for defaultBackoff.403 Forbidden (with specific substatuses)Retry after marking endpoint unavailable and updating the endpoint manager.404 Not Found (specific substatus)Retry by switching to another session or endpoint.503 Service UnavailableRetry by switching to another preferred location. Let's see some of these in action. Non-Retriable Errors For example, here is a function that tries to read a database that does not exist. Go func retryPolicy1() { c, err := auth.GetClientWithDefaultAzureCredential("https://demodb.documents.azure.com:443/", nil) if err != nil { log.Fatal(err) } azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) // Set logging level to include retries azlog.SetEvents(azlog.EventRetryPolicy) db, err := c.NewDatabase("i_dont_exist") if err != nil { log.Fatal("NewDatabase call failed", err) } _, err = db.Read(context.Background(), nil) if err != nil { log.Fatal("Read call failed: ", err) } The azcore logging implementation is configured using SetListener and SetEvents to write retry policy event logs to standard output. See Logging section in azcosmos package README for details. Let's look at the logs generated when this code is run: Plain Text //.... Retry Policy Event: exit due to non-retriable status code Retry Policy Event: =====> Try=1 for GET https://demodb.documents.azure.com:443/dbs/i_dont_exist Retry Policy Event: response 404 Retry Policy Event: exit due to non-retriable status code Read call failed: GET https://demodb-region.documents.azure.com:443/dbs/i_dont_exist -------------------------------------------------------------------------------- RESPONSE 404: 404 Not Found ERROR CODE: 404 Not Found When a request is made to read a non-existent database, the SDK gets a 404 (not found) response for the database. This is recognized as a non-retriable error, and the SDK stops retrying. Retries are only performed for retriable errors (like network issues or certain status codes). The operation failed because the database does not exist. Retriable Errors - Invalid Account This function tries to create a Cosmos DB client using an invalid account endpoint. It sets up logging for retry policy events and attempts to create a database. Go func retryPolicy2() { c, err := auth.GetClientWithDefaultAzureCredential("https://iamnothere.documents.azure.com:443/", nil) if err != nil { log.Fatal(err) } azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) // Set logging level to include retries azlog.SetEvents(azlog.EventRetryPolicy) _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil) if err != nil { log.Fatal(err) } } Let's look at the logs generated when this code is run, and see how the SDK handles retries when the endpoint is unreachable: Plain Text //.... Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #1, Delay=682.644105ms Retry Policy Event: =====> Try=2 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #2, Delay=2.343322179s Retry Policy Event: =====> Try=3 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #3, Delay=7.177314269s Retry Policy Event: =====> Try=4 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: MaxRetries 3 exceeded failed to retrieve account properties: Get "https://iamnothere.docume Each failed attempt is logged, and the SDK retries the operation several times (three times to be specific), with increasing delays between attempts. After exceeding the maximum number of retries, the operation fails with an error indicating the host could not be found - the SDK automatically retries transient network errors before giving up. But you don't have to stick to the default retry policy. You can customize the retry policy by setting the azcore.ClientOptions when creating the Cosmos DB client. Configurable Retries Let's say you want to set a custom retry policy with a maximum of two retries and a delay of one second between retries. You can do this by creating a policy.RetryOptions struct and passing it to the azcosmos.ClientOptions when creating the client. Go func retryPolicy3() { retryPolicy := policy.RetryOptions{ MaxRetries: 2, RetryDelay: 1 * time.Second, } opts := azcosmos.ClientOptions{ ClientOptions: policy.ClientOptions{ Retry: retryPolicy, }, } c, err := auth.GetClientWithDefaultAzureCredential("https://iamnothere.documents.azure.com:443/", &opts) if err != nil { log.Fatal(err) } log.Println(c.Endpoint()) azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) azlog.SetEvents(azlog.EventRetryPolicy) _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil) if err != nil { log.Fatal(err) } } Each failed attempt is logged, and the SDK retries the operation according to the custom policy — only two retries, with a 1-second delay after the first attempt and a longer delay after the second. After reaching the maximum number of retries, the operation fails with an error indicating the host could not be found. Plain Text Retry Policy Event: =====> Try=1 for GET https://iamnothere.documents.azure.com:443/ //.... Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #1, Delay=1.211970493s Retry Policy Event: =====> Try=2 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #2, Delay=3.300739653s Retry Policy Event: =====> Try=3 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: MaxRetries 2 exceeded failed to retrieve account properties: Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host exit status 1 Note: The first attempt is not counted as a retry, so the total number of attempts is three (1 initial + 2 retries). You can customize this further by implementing fault injection policies. This allows you to simulate various error scenarios for testing purposes. Fault Injection For example, you can create a custom policy that injects a fault into the request pipeline. Here, we use a custom policy (FaultInjectionPolicy) that simulates a network error on every request. Go type FaultInjectionPolicy struct { failureProbability float64 // e.g., 0.3 for 30% chance to fail } // Implement the Policy interface func (f *FaultInjectionPolicy) Do(req *policy.Request) (*http.Response, error) { if rand.Float64() < f.failureProbability { // Simulate a network error return nil, &net.OpError{ Op: "read", Net: "tcp", Err: errors.New("simulated network failure"), } } // no failure - continue with the request return req.Next() } This can be used to inject custom failures into the request pipeline. The function configures the Cosmos DB client to use this policy, sets up logging for retry events, and attempts to create a database. Go func retryPolicy4() { opts := azcosmos.ClientOptions{ ClientOptions: policy.ClientOptions{ PerRetryPolicies: []policy.Policy{&FaultInjectionPolicy{failureProbability: 0.6}, }, } c, err := auth.GetClientWithDefaultAzureCredential("https://ACCOUNT_NAME.documents.azure.com:443/", &opts) // Updated to use opts if err != nil { log.Fatal(err) } azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) // Set logging level to include retries azlog.SetEvents(azlog.EventRetryPolicy) _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test_1"}, nil) if err != nil { log.Fatal(err) } } Take a look at the logs generated when this code is run. Each request attempt fails due to the simulated network error. The SDK logs each retry, with increasing delays between attempts. After reaching the maximum number of retries (default = 3), the operation fails with an error indicating a simulated network failure. Note: This can change depending on the failure probability you set in the FaultInjectionPolicy. In this case, we set it to 0.6 (60% chance to fail), so you may see different results each time you run the code. Plain Text Retry Policy Event: =====> Try=1 for GET https://ACCOUNT_NAME.documents.azure.com:443/ //.... Retry Policy Event: MaxRetries 0 exceeded Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #1, Delay=794.018648ms Retry Policy Event: =====> Try=2 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #2, Delay=2.374693498s Retry Policy Event: =====> Try=3 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #3, Delay=7.275038434s Retry Policy Event: =====> Try=4 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: MaxRetries 3 exceeded Retry Policy Event: =====> Try=1 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #1, Delay=968.457331ms 2025/05/05 19:53:50 failed to retrieve account properties: read tcp: simulated network failure exit status 1 Do take a look at Custom HTTP pipeline policies in the Azure SDK for Go documentation for more information on how to implement custom policies. HTTP-Level Customizations There are scenarios where you may need to customize the HTTP client used by the SDK. For example, when using the Cosmos DB emulator locally, you want to skip certificate verification to connect without SSL errors during development or testing. TLSClientConfig allows you to customize TLS settings for the HTTP client and setting InsecureSkipVerify: true disables certificate verification – useful for local testing but insecure for production. Go func customHTTP1() { // Create a custom HTTP client with a timeout client := &http.Client{ Transport: &http.Transport{ TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, }, } clientOptions := &azcosmos.ClientOptions{ ClientOptions: azcore.ClientOptions{ Transport: client, }, } c, err := auth.GetEmulatorClientWithAzureADAuth("http://localhost:8081", clientOptions) if err != nil { log.Fatal(err) } _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil) if err != nil { log.Fatal(err) } } All you need to do is pass the custom HTTP client to the ClientOptions struct when creating the Cosmos DB client. The SDK will use this for all requests. Another scenario is when you want to set a custom header for all requests to track requests or add metadata. All you need to do is implement the Do method of the policy.Policy interface and set the header in the request: Go type CustomHeaderPolicy struct{} func (c *CustomHeaderPolicy) Do(req *policy.Request) (*http.Response, error) { correlationID := uuid.New().String() req.Raw().Header.Set("X-Correlation-ID", correlationID) return req.Next() } Looking at the logs, notice the custom header X-Correlation-ID is added to each request: Plain Text //... Request Event: ==> OUTGOING REQUEST (Try=1) GET https://ACCOUNT_NAME.documents.azure.com:443/ Authorization: REDACTED User-Agent: azsdk-go-azcosmos/v1.3.0 (go1.23.6; darwin) X-Correlation-Id: REDACTED X-Ms-Cosmos-Sdk-Supportedcapabilities: 1 X-Ms-Date: Tue, 06 May 2025 04:27:37 GMT X-Ms-Version: 2020-11-05 Request Event: ==> OUTGOING REQUEST (Try=1) POST https://ACCOUNT_NAME-region.documents.azure.com:443/dbs Authorization: REDACTED Content-Length: 27 Content-Type: application/query+json User-Agent: azsdk-go-azcosmos/v1.3.0 (go1.23.6; darwin) X-Correlation-Id: REDACTED X-Ms-Cosmos-Sdk-Supportedcapabilities: 1 X-Ms-Date: Tue, 06 May 2025 04:27:37 GMT X-Ms-Documentdb-Query: True X-Ms-Version: 2020-11-05 OpenTelemetry Support The Azure Go SDK supports distributed tracing via OpenTelemetry. This allows you to collect, export, and analyze traces for requests made to Azure services, including Cosmos DB. The azotel package is used to connect an instance of OpenTelemetry's TracerProvider to an Azure SDK client (in this case, Cosmos DB). You can then configure the TracingProvider in azcore.ClientOptions to enable automatic propagation of trace context and emission of spans for SDK operations. Go func getClientOptionsWithTracing() (*azcosmos.ClientOptions, *trace.TracerProvider) { exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint()) if err != nil { log.Fatalf("failed to initialize stdouttrace exporter: %v", err) } tp := trace.NewTracerProvider(trace.WithBatcher(exporter)) otel.SetTracerProvider(tp) op := azcosmos.ClientOptions{ ClientOptions: policy.ClientOptions{ TracingProvider: azotel.NewTracingProvider(tp, nil), }, } return &op, tp } The above function creates a stdout exporter for OpenTelemetry (prints traces to the console). It sets up a TracerProvider, registers this as the global tracer, and returns a ClientOptions struct with the TracingProvider set, ready to be used with the Cosmos DB client. Go func tracing() { op, tp := getClientOptionsWithTracing() defer func() { _ = tp.Shutdown(context.Background()) }() c, err := auth.GetClientWithDefaultAzureCredential("https://ACCOUNT_NAME.documents.azure.com:443/", op) //.... container, err := c.NewContainer("existing_db", "existing_container") if err != nil { log.Fatal(err) } //ctx := context.Background() tracer := otel.Tracer("tracer_app1") ctx, span := tracer.Start(context.Background(), "query-items-operation") defer span.End() query := "SELECT * FROM c" pager := container.NewQueryItemsPager(query, azcosmos.NewPartitionKey(), nil) for pager.More() { queryResp, err := pager.NextPage(ctx) if err != nil { log.Fatal("query items failed:", err) } for _, item := range queryResp.Items { log.Printf("Queried item: %+v\n", string(item)) } } } The above function calls getClientOptionsWithTracing to get tracing-enabled options and a tracer provider, and ensures the tracer provider is shut down at the end (flushes traces). It creates a Cosmos DB client with tracing enabled, executes an operation to query items in a container. The SDK call is traced automatically, and exported to stdout in this case. You can plug in any OpenTelemetry-compatible tracer provider and traces can be exported to various backend. Here is a snippet for Jaeger exporter. The traces are quite large, so here is a small snippet of the trace output. Check the query_items_trace.txt file in the repo for the full trace output: Go //... { "Name": "query_items democontainer", "SpanContext": { "TraceID": "39a650bcd34ff70d48bbee467d728211", "SpanID": "f2c892bec75dbf5d", "TraceFlags": "01", "TraceState": "", "Remote": false }, "Parent": { "TraceID": "39a650bcd34ff70d48bbee467d728211", "SpanID": "b833d109450b779b", "TraceFlags": "01", "TraceState": "", "Remote": false }, "SpanKind": 3, "StartTime": "2025-05-06T17:59:30.90146+05:30", "EndTime": "2025-05-06T17:59:36.665605042+05:30", "Attributes": [ { "Key": "db.system", "Value": { "Type": "STRING", "Value": "cosmosdb" } }, { "Key": "db.cosmosdb.connection_mode", "Value": { "Type": "STRING", "Value": "gateway" } }, { "Key": "db.namespace", "Value": { "Type": "STRING", "Value": "demodb-gosdk3" } }, { "Key": "db.collection.name", "Value": { "Type": "STRING", "Value": "democontainer" } }, { "Key": "db.operation.name", "Value": { "Type": "STRING", "Value": "query_items" } }, { "Key": "server.address", "Value": { "Type": "STRING", "Value": "ACCOUNT_NAME.documents.azure.com" } }, { "Key": "az.namespace", "Value": { "Type": "STRING", "Value": "Microsoft.DocumentDB" } }, { "Key": "db.cosmosdb.request_charge", "Value": { "Type": "STRING", "Value": "2.37" } }, { "Key": "db.cosmosdb.status_code", "Value": { "Type": "INT64", "Value": 200 } } ], //.... Refer to Semantic Conventions for Microsoft Cosmos DB. What About Other Metrics? When executing queries, you can get basic metrics about the query execution. The Go SDK provides a way to access these metrics through the QueryResponse struct in the QueryItemsResponse object. This includes information about the query execution, including the number of documents retrieved, etc. Plain Text func queryMetrics() { //.... container, err := c.NewContainer("existing_db", "existing_container") if err != nil { log.Fatal(err) } query := "SELECT * FROM c" pager := container.NewQueryItemsPager(query, azcosmos.NewPartitionKey(), nil) for pager.More() { queryResp, err := pager.NextPage(context.Background()) if err != nil { log.Fatal("query items failed:", err) } log.Println("query metrics:\n", *queryResp.QueryMetrics) //.... } } The query metrics are provided as a simple raw string in a key-value format (semicolon-separated), which is very easy to parse. Here is an example: Plain Text totalExecutionTimeInMs=0.34;queryCompileTimeInMs=0.04;queryLogicalPlanBuildTimeInMs=0.00;queryPhysicalPlanBuildTimeInMs=0.02;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=0.07;indexLookupTimeInMs=0.00;instructionCount=41;documentLoadTimeInMs=0.04;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=9;retrievedDocumentSize=1251;outputDocumentCount=9;outputDocumentSize=2217;writeOutputTimeInMs=0.02;indexUtilizationRatio=1.00 Here is a breakdown of the metrics you can obtain from the query response: Plain Text | Metric | Unit | Description | | ------------------------------ | ----- | ------------------------------------------------------------ | | totalExecutionTimeInMs | ms | Total time taken to execute the query, including all phases. | | queryCompileTimeInMs | ms | Time spent compiling the query. | | queryLogicalPlanBuildTimeInMs | ms | Time spent building the logical plan for the query. | | queryPhysicalPlanBuildTimeInMs | ms | Time spent building the physical plan for the query. | | queryOptimizationTimeInMs | ms | Time spent optimizing the query. | | VMExecutionTimeInMs | ms | Time spent executing the query in the Cosmos DB VM. | | indexLookupTimeInMs | ms | Time spent looking up indexes. | | instructionCount | count | Number of instructions executed for the query. | | documentLoadTimeInMs | ms | Time spent loading documents from storage. | | systemFunctionExecuteTimeInMs | ms | Time spent executing system functions in the query. | | userFunctionExecuteTimeInMs | ms | Time spent executing user-defined functions in the query. | | retrievedDocumentCount | count | Number of documents retrieved by the query. | | retrievedDocumentSize | bytes | Total size of documents retrieved. | | outputDocumentCount | count | Number of documents returned as output. | | outputDocumentSize | bytes | Total size of output documents. | | writeOutputTimeInMs | ms | Time spent writing the output. | | indexUtilizationRatio | ratio | Ratio of index utilization (1.0 means fully utilized). | Conclusion In this blog, we covered how to configure and customize the Go SDK for Azure Cosmos DB. We looked at retry policies, HTTP-level customizations, OpenTelemetry support, and how to access query metrics. The Go SDK for Azure Cosmos DB is designed to be flexible and customizable, allowing you to tailor it to your specific needs. For more information, refer to the package documentation and the GitHub repository. I hope you find this useful! Resources Go SDK for Azure Cosmos DBCore Azure Go SDK packageClientOptionsNewClient
Lots of businesses use Oracle databases to keep their important stuff. These databases mostly work fine, but yeah, sometimes they run into issues. Anyone who's worked with Oracle knows the feeling when things go wrong. Don't worry, though — these problems happen to everyone. Most fixes are actually pretty easy once you know what you are doing. I'll show you the usual Oracle headaches and how to fix them. 1. The "Snapshot Too Old" Error (ORA-01555) What's Happening Oracle is basically saying, "I can't remember what that data looked like anymore" when your query runs too long. Why It Happens Oracle already overwrote the old data it was keeping for reference.Your query is taking longer than Oracle is set to remember things.You are committing changes too often in a loop. How to Fix It Tell Oracle to remember things longer. SQL ALTER SYSTEM SET UNDO_RETENTION = 2000; Don't put COMMIT statements inside loops.Improve the performance of the queries by adding appropriate indexes. 2. The "Resource Busy" Error (ORA-00054) What's Happening You are trying to update something that someone/process is already using. Why It Happens Another user or process has locked the table or row you want. How to Fix It Find out who is blocking it. SQL SELECT s1.sid AS blocked_session_id, s1.serial# AS blocked_serial_num, s1.username AS blocked_user, s1.machine AS blocked_machine, s1.blocking_session AS blocking_session_id, s1.sql_id AS blocked_sql_id FROM v$session s1 WHERE s1.blocking_session IS NOT NULL ORDER BY s1.blocking_session; If needed, kill the process or just tell Oracle to wait a bit instead of giving up right away. SQL ALTER SYSTEM KILL SESSION 'sid, serial#' IMMEDIATE; 3. Sudden Disconnection Error (ORA-03113) What's Happening Your connection to the database dropped unexpectedly. Why It Happens Network issuesThe database crashed or restartedSoftware bugs How to Fix It Keep an eye on logs for alerts and make sure the database has not crashed.Look at the trace files for clues.Make sure your network is stable.Update Oracle if it's a known bug. 4. Permission Denied Error (ORA-01031) What's Happening Oracle won't let you do something because you don't have permission. Why It Happens Your user account doesn't have the right privileges. How to Fix It Get the permission you need. SQL GRANT CREATE TABLE TO username; For looking at someone else's data. SQL GRANT SELECT ON hr.employees TO app_user; 5. Password Expired Error (ORA-28001) What's Happening The user's password has expired. Why It Happens Oracle is enforcing password expiration rules. How to Fix It Reset the password: SQL ALTER USER username IDENTIFIED BY new_password; Or stop passwords from expiring. SQL ALTER PROFILE DEFAULT LIMIT PASSWORD_LIFE_TIME UNLIMITED; 6. Can't Connect Error (ORA-12154) What's Happening Oracle doesn't understand how to connect to the database you are asking for. Why It Happens The connection info is wrong or missing in your setup files. How to Fix It Check your tnsnames.ora file for mistakes.Make sure the service name matches.Try the simple connection format instead. SQL sqlplus user/password@//host:port/service_name 7. No More Space Error (ORA-01653) What's Happening You can't add more data because you are out of space. Why It Happens Your database file is full and cannot grow automatically. How to Fix It Add another data file: SQL ALTER TABLESPACE users ADD DATAFILE '/path/users10.dbf' SIZE 250M AUTOEXTEND ON MAXSIZE 500M; Let your existing file grow automatically. SQL ALTER DATABASE DATAFILE '/path/users11.dbf' RESIZE 500M; Keep an eye on your space. SQL SELECT tablespace_name, used_space, tablespace_size FROM dba_tablespace_usage_metrics; 8. Internal Error (ORA-00600) What's Happening Oracle came across a problem that it doesn’t know how to handle. Why It Happens Memory or data corruptionHardware failuresIncompatible parameter settings How to Fix It Run DBVERIFY or ANALYZE commands to check if the database is corrupted; if so, then it has to be restored from the backup.Work with Oracle support and share the logs and errors to help debug. 9. Super Slow Queries What's Happening This is a common problem where the query performance degrades with an increase in data. Why It Happens Poorly written SQLMissing indexesOutdated statisticsRunning the queries without filters How to Fix It See how Oracle is running your query. SQL EXPLAIN PLAN FOR type_your_query_here; SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY); Add indexes where needed.WHERE clauses should be used properly to return the correct result.Update the statistics. SQL EXEC DBMS_STATS.GATHER_TABLE_STATS('schema', 'table_name'); 10. Corrupted Data What's Happening Part of your database file got corrupted or damaged. Why It Happens Hardware failuresSudden shutdownsSoftware bugs How to Fix It Find the bad blocks. SQL SELECT * FROM v$database_block_corruption; Use RMANto repair them. SQL RMAN> BLOCKRECOVER DATAFILE 5 BLOCK 233 TO 245; Mark blocks that can't be fixed. SQL EXEC DBMS_REPAIR.ADMIN_TABLES(TRUE, FALSE, 'REPAIR_TABLE'); 11. High CPU Usage by Oracle Applications What's Happening Oracle is using too much CPU and slowing everything down. Why It Happens Inefficient queriesMissing indexesToo many background processes How to Fix It Find the queries that consume high CPU. SQL SELECT * FROM v$sql ORDER BY cpu_time DESC FETCH FIRST 20 ROWS ONLY; Fix those queries.Run performance reports if you have the license.Consider moving old data to archives. Tips to Avoid Problems Always have backups. This is the key to any database management as the system can always be reverted to avoid any data loss.Update statistics regularly. Oracle needs current info to work wellCheck logs often. Catch problems early by analyzing the logs periodicallyTest before production. Try changes in a test/stage environment first so that the majority of issues are caught before promoting the code to prod.Set up automatic health checks. Schedules can help keep everything aligned by running the processes on time. Conclusion Working with Oracle databases gets easier with practice. A lot of the problems you will run into are just the same few issues that everyone deals with. The more you work with these issues, the faster you'll spot and fix them. Hopefully, this article makes your database work a little easier and less stressful.
Core dumps play a key role in debugging programs that exit abnormally. They preserve the state of a program at failure, and with them, programmers can view and identify the causes of failures. In this article, a walkthrough is taken through a step-by-step exercise of enabling, creating, and checking out core dumps in Linux, and touches on high-end tools and techniques for debugging sophisticated failures, enabling quick diagnoses and resolution. 1. Enabling Core Dumps in Linux Check and Set ulimit Check for current value for core dumps: Shell ulimit -c Output: Shell 0 A value of 0 informs us that core dumps have been disabled.. Enable core dumps temporarily: Shell ulimit -c unlimited Check again: Shell ulimit -c Output: Shell unlimited To make it persistent, add the following in /etc/security/limits.conf: Shell * soft core unlimited * hard core unlimited Configure location for core dumps: Shell sudo sysctl -w kernel.core_pattern=/var/dumps/core.%e.%p %e: Program name %p: Process ID Make it persistent: Add the following in /etc/sysctl.conf: Shell kernel.core_pattern=/var/dumps/core.%e.%p Reload configuration: Shell sudo sysctl -p 2. Generate Core Dumps for Testing C Program to cause Segfault Create a testing program: Shell #include <stdio.h> int main() { int *ptr = NULL; // Null pointer *ptr = 1; // Segmentation fault return 0; } Make a build of it: Shell gcc -g -o crash_test crash_test.c Execute the program: Shell ./crash_test Output: Shell Segmentation fault (core dumped) Check for location of the core dump: Shell ls /var/dumps Example: Shell core.crash_test.12345 3. Analyze with GDB Load the Core Dump Shell gdb ./crash_test /var/dumps/core.crash_test.12345 Basic Analysis Get Backtrace Shell bt Output: Shell #0 0x0000000000401132 in main () at crash_test.c:4 4 *ptr = 1; The backtrace identifies that the crash happened at line 4. Examine Variables Shell info locals Output: Shell ptr = (int *) 0x0 The variable ptr is a null pointer, and a segmentation fault is confirmed. Disassemble the Code Shell disassemble main Output: Shell Dump of assembler code for function main: 0x000000000040112a <+0>: mov %rsp,%rbp 0x000000000040112d <+3>: movl $0x0,-0x4(%rbp) 0x0000000000401134 <+10>: movl $0x1,(%rax) Displays actual assembly instruction at location of crash. Check Registers Shell info registers Output: Shell rax 0x0 0 rbx 0x0 0 rcx 0x0 0 Register rax is 0, showing the null pointer dereference. 4. Debugging Multithreaded Applications Check Threads Shell info threads Output: Shell Id Target Id Frame * 1 Thread 0x7f64c56 (LWP 12345) "crash_test" main () at crash_test.c:4 Switch to a Specific Thread Shell thread 1 Get Backtrace for All Threads Shell thread apply all bt 5. Using Advanced Tools Valgrind – Analysis of Memory Issue Shell valgrind --tool=memcheck ./crash_test Output: Shell Invalid write of size 4 at 0x401132: main (crash_test.c:4) Address 0x0 is not stack'd, malloc'd or (recently) free'd Confirms an invalid access in memory. ELFutils for Symbol Inspection Shell eu-readelf -a /var/dumps/core.crash_test.12345 Output: Shell Program Headers: LOAD 0x000000 0x004000 0x004000 0x1234 bytes Displays sections and symbol information in the core file. Crash Utility for Kernel Dumps Shell sudo crash /usr/lib/debug/vmlinux /var/crash/core Use for kernel-space core dumps. Generate Core Dumps for Running Processes Shell gcore <PID> 6. Debugging Specific Issues Segmentation Faults Shell info frame x/16xw $sp Check near stack pointer in memory. Heap Corruption Check for heap corruption with Valgrind or AddressSanitizer: Shell gcc -fsanitize=address -g -o crash_test crash_test.c ./crash_test Shared Libraries Mismatch Shell info shared ldd ./crash_test Check shared libraries loaded in a proper manner. 7. Best Practices for Core Dump Debugging Save Symbols Independently: Deploy stripped binaries for production and store debug symbols in a secure manner.Automate Dump: Employ systemd-coredump for efficient dump management.Analyze Logs: Enable full application logs for tracking run-time faults.Redact Sensitive Information: Remove sensitive information in shared core dumps.Test in Debugging: Employ debug build with full symbols for in-depth debugging. 8. Conclusion Debugging a core dump in Linux is a logical progression and utilizes tools such as GDB, Valgrind, and Crash Utility. By careful examination of backtraces, state of memory, and register values, developers can pinpoint root causes and remedy them in no time. Best practices will yield more efficient diagnostics and rapid resolution for production-critical environments.
Abhishek Gupta
Principal PM, Azure Cosmos DB,
Microsoft
Yitaek Hwang
Software Engineer,
NYDIG