DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Tools

Development and programming tools are used to build frameworks, and they can be used for creating, debugging, and maintaining programs — and much more. The resources in this Zone cover topics such as compilers, database management systems, code editors, and other software tools and can help ensure engineers are writing clean code.

icon
Latest Premium Content
Trend Report
Kubernetes in the Enterprise
Kubernetes in the Enterprise
Refcard #366
Advanced Jenkins
Advanced Jenkins
Refcard #378
Apache Kafka Patterns and Anti-Patterns
Apache Kafka Patterns and Anti-Patterns

DZone's Featured Tools Resources

Automatic Code Transformation With OpenRewrite

Automatic Code Transformation With OpenRewrite

By Gangadhararamachary Ramadugu
Code Maintenance/Refactoring Challenges As with most problems in business, the challenge with maintaining code is to minimize cost and maximize benefit over some reasonable amount of time. For software maintenance, costs and benefits largely revolve around two things: the quantity and quality of both old and new code. Quantity SonarQube suggests our organization maintain at least 80 million lines of code. That’s a lot, especially if we stay current with security patches and rapid library upgrades. Quality In fast-paced environments, which we often find ourselves in, a lot of code changes must come from: Copying what you see, either from nearby code or from places like StackOverflow.Knowledge that can be applied quickly. These typically boil down to decisions made by individual programmers. Of course, this comes with pros and cons. This post is not meant to suggest human contributions are not extremely beneficial! I will discuss some benefits and negatives of automated refactoring and why we are moving from Butterfly, our current tool for automated refactoring, to OpenRewrite. Benefits and Costs of Automated Refactoring When we think about automation, we typically think about the benefits, and that is where I’ll start. Some include: If a recipe exists and works perfectly, the human cost is almost 0, especially if you have an easy way to apply recipes on a large scale. Of course, this human cost saving is the obvious and huge benefit.Easy migration to newer libraries/patterns/etc. brings security patches, performance improvements, and lower maintenance costs.An automated change can be educational. Hopefully, we still find time to thoroughly read documentation, but we often don’t! Seeing your refactored code should be educational and should help with future development costs. There are costs to automated refactoring. I will highlight: If a recipe does not exist, OpenRewrite is not cost-free. As with all software, the cost of creating a recipe will need to be justified by its benefit. These costs may become substantial if we try to move towards a code change that is not reviewed by humans.OpenRewrite and AI reward you if you stick with commonly used programming languages, libraries, tools, etc. Sometimes going against the norm is justified. For example, Raptor 4's initial research phases looked at other technology stacks besides Spring and JAX-RS. Some goals included performance improvement. One of the reasons those other options were rejected is that they did not have support in Raptor's automated refactoring tool. Decisions like this can have a big impact on a larger organization.Possible loss of ‘design evolution.’ I believe in the ‘good programmers are lazy’ principle, and part of that laziness is avoiding the pain you go through to keep software up to date. This laziness serves to evolve software so that it can easily be updated. If you take away the pain, you take away one of the main incentives for doing that. What We’ve Been Using: Butterfly ‘Butterfly’ is a two-part system. Its open-source command-line interface (CLI), officially named ‘Butterfly’, modifies files. There is also a hosted transformation tool called Butterfly, which can be used to run Butterfly transformations on GitHub repositories. This post focuses on replacing the CLI and its extension API with OpenRewrite. There is an OpenRewrite-powered large-scale change management tool (LSCM) named Moderne, which is not free. Where We Are Going: OpenRewrite Why are we switching to OpenRewrite? Adopted by open source projects that we use (Spring, Java, etc.).Maintained by a company, Moderne.Lossless Semantic Trees (akin to Abstract Syntax Trees), which allow compiler-like transformation. These are much more powerful than tools like regular expression substitution.Visitor pattern. Tree modification happens primarily by visiting tree members.They are tracking artificial intelligence to see how it can be leveraged for code transformation. We are still early in the journey with OpenRewrite. While it is easy to use existing recipes, crafting new ones can be tricky. What About Artificial Intelligence? If you aren’t investigating AI, you certainly should be. If AI can predict what code should be created for a new feature, it certainly should be useful in code transformation, which is arguably easier than creation. Our organization has started the journey of incorporating AI into its toolset. We will be monitoring how tools like OpenRewrite and AI augment one another. On that note, we are investigating using AI to create OpenRewrite recipes. How We’ve Used OpenRewrite Manually running recipes against a single software project. There have been multiple uses of OpenRewrite against an individual software project. I come from the JVM framework team, so our usage involved refactoring Java libraries. You can find some examples of that below: JUnit 4 to JUnit 5 JAX-RS refactoring. Comments discuss some impressive changes. Note that there are multiple commits. More on why that was needed later.Nice GitHub release notes refactoring. This is a trivial PR, but being able to do it on a large scale with low cost helps with cost-based arguments when value is not widely agreed upon.Running UpgradeSpringBoot_3_2, CommonStaticAnalysis, UpgradeToJava17, and MigrateHamcrestToAssertJ recipes on a larger organization project with a whopping 800K lines of code resulted in ~200K modified lines spanning ~4K files with an estimated time savings of ~8 days. I believe that is quite an underestimate of the savings! JUnit4 -> JUnit5 refactoring. Estimated savings: 1d 23h 31m.Common static analysis refactoring. Estimated savings: 3d 21h 29m. If you are tired of manually satisfying Sonar, then this recipe could be for you! Unfortunately, these need to be bulk closed due to an issue (we’re not trying to hide anything!). You can read about that here. Again, I think OpenRewrite significantly underestimates some of these savings. Execution time was ~20 minutes. That was the computer’s time, not mine! Caveat: It’s Only Easy When It’s Easy When a recipe exists and has no bugs, everything is great! When it doesn’t, you have multiple questions. The two main ones are: Does an LST/parser exist? For example, OpenRewrite has no parser for C++ code, so there is no way to create a recipe for that language.If there is an LST/parser, how difficult is it to create a recipe? There are a bunch of interesting and easy ways to compose existing recipes; however, when you have to work directly with an LST, it can be challenging. In short, it’s not always the answer. Good code development and stewardship still play a large role in minimizing long-term costs. Manual Intervention So far, the most complicated transformations have required human cleanup. Fortunately, those were in test cases, and the issues were apparent in a failed build. Until we get more sophisticated with detecting breaking changes, please understand that you own the changes, even if they come via a tool like OpenRewrite. Triaging Problems OpenRewrite does not have application logging like normal Java software. It also does not always produce errors in ways that you might expect. To help with these problems, we have a recommendations page in our internal OpenRewrite documentation. Conclusion Hopefully, you are excited about the new tools coming that will help you maximize the value. Resource OpenRewrite documentation More
A Complete Guide to Modern AI Developer Tools

A Complete Guide to Modern AI Developer Tools

By Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE
Based on my previous articles exploring AI, machine learning, and generative AI, many developers have reached out to understand how these technologies can enhance their workflows, from improving coding skills and streamlining model training to deploying APIs and beyond. The rapid evolution of artificial intelligence (AI) has led to a surge in specialized tools designed to streamline development, collaboration, and deployment. This guide explores the most impactful AI developer tools, highlighting their features, installation steps, strengths, and limitations. Whether you’re training models, deploying APIs, or debugging workflows, this article will help you choose the right tool for your needs. Categories of AI Tools AI tools are designed to address specific stages of the development lifecycle, and understanding their categories helps teams select the right solutions. Model development and experiment tracking tools like Weights & Biases and MLflow streamline logging metrics, comparing model iterations, and optimizing hyperparameters. For deployment and serving, platforms such as BentoML and MLflow simplify packaging models into scalable APIs or Docker containers. Collaboration and MLOps tools like Comet enhance team workflows with versioning, compliance, and long-term monitoring. Natural Language Processing (NLP) specialists rely on Hugging Face Transformers and LangChain to access pre-trained language models and build LLM-driven applications. Developer productivity tools, such as AI-powered IDEs and Warp, integrate AI into daily coding tasks, offering intelligent code completion and command automation. Lastly, workflow automation platforms like n8n connect AI models with APIs and services, enabling end-to-end pipeline orchestration. Each category addresses unique challenges, ensuring developers have tailored solutions for every phase of AI development. 1. Weights & Biases (W&B) Experiment Tracking and Model Optimization Introduction Keeping track of experiments can be daunting. Weights & Biases (W&B) simplifies this challenge by offering a unified platform for researchers and teams to log experiments, visualize metrics, and collaborate in real-time. W&B turns chaotic workflows into organized, actionable insights. Key Features Real-time metrics and visualization dashboards.Hyperparameter tuning with sweeps.Dataset versioning and model artifact storage.Integration with PyTorch, TensorFlow, and JAX. Installation Shell pip install wandb wandb login # Authenticate with API key Pros Intuitive UI for tracking experiments.Strong collaboration features for teams.Supports on-premises deployment. Cons Free tier has limited storage.Advanced features require a paid plan. Best Use Cases Research teams comparing model iterations.Hyperparameter optimization at scale. 2. MLflow End-to-End Machine Learning Lifecycle Introduction Managing the machine learning lifecycle — from experimentation to deployment — often feels like herding cats. MLflow tackles this chaos by providing an open-source framework to log experiments and package models, and deploy them seamlessly. Designed for flexibility, it integrates with almost any ML library, making it a Swiss Army knife for MLOps. Key Features Experiment logging (parameters, metrics, artifacts).Model registry for versioning.Deployment to REST APIs or Docker containers.Integration with Apache Spark and Kubernetes. Installation Shell pip install mlflow Pros Open-source and free.Flexible deployment options.Broad framework support (scikit-learn, PyTorch). Cons UI is less polished than W&B or Comet.Limited native collaboration tools. Best Use Cases Teams needing a free, customizable MLOps solution.Deploying models to Kubernetes or cloud platforms. 3. Hugging Face Transformers State-of-the-Art NLP Models Introduction Natural language processing (NLP) has exploded in complexity, but Hugging Face Transformers makes cutting-edge models accessible to everyone. With its vast repository of pre-trained models like BERT and GPT, this library democratizes NLP, enabling developers to implement translation, summarization, and text generation with minimal code. Check the Model Hub. Key Features 100,000+ pre-trained models.Pipelines for inference with minimal code.Fine-tuning and sharing models via the Hub.Integration with TensorFlow and PyTorch. Installation Shell pip install transformers Pros Largest library of NLP models.Active community and extensive tutorials.Free for most use cases. Cons Steep learning curve for customization.Large models require significant compute. Best Use Cases NLP projects needing pre-trained models.Rapid prototyping of language applications. 4. BentoML Model Serving and Deployment Introduction Deploying machine learning models into production is notoriously fraught with challenges. BentoML eases this transition by packaging models, dependencies, and inference logic into portable, scalable units called “Bentos.” Designed for developers, it bridges the gap between experimentation and production without sacrificing performance. Key Features Auto-generates Docker/Helm configurations.Supports ONNX, TensorFlow, and PyTorch.Monitoring with Prometheus/Grafana.Kubernetes-native scaling. Installation Shell pip install bentoml Pros High-performance serving.Easy integration with MLflow or W&B.Unified environment for dev/prod. Cons Setup complexity for distributed systems.Limited UI for monitoring. Best Use Cases Deploying models as microservices.Teams transitioning from Jupyter notebooks to production. 5. Warp AI-Powered Terminal for Developers Introduction The terminal is a developer’s best friend — until it becomes a maze of forgotten commands and cryptic errors. Warp reimagines the command-line interface with AI-powered suggestions, collaborative workflows, and a modern design. It’s like having a pair programmer in your terminal, guiding you through complex tasks. Warp in Dispatch (Beta) mode Key Features AI command search (e.g., “How to kill a process on port 3000?”).Shared workflows and snippets.Built-in documentation lookup.GPU-accelerated rendering. Installation Download from Warp’s website (macOS only; Linux/Windows in beta). Pros Reduces terminal friction for beginners.Clean, intuitive interface. Cons Limited to macOS for stable releases.Requires subscription for team features. Best Use Cases Developers streamlining CLI workflows.Teams onboarding new engineers. 6. LangChain Building Applications With LLMs Introduction Large language models (LLMs) like GPT-4 are powerful, but harnessing their potential requires more than simple API calls. LangChain provides a framework to build sophisticated LLM-driven applications, such as chatbots, document analyzers, and autonomous agents. By chaining prompts, integrating data sources, and managing memory, LangChain turns raw AI power into structured, real-world solutions. Key Features Chains for multi-step LLM workflows.Integration with vector databases (e.g., Pinecone).Memory management for conversational apps.Tools for structured output parsing. Installation Shell pip install langchain Pros Modular design for complex LLM apps.Extensive documentation and examples. Cons Rapid API changes can break code.Requires familiarity with LLM limitations. Best Use Cases Developing AI chatbots or document analyzers.Prototyping agent-based workflows. 7. Comet ML Model Management and Monitoring Introduction For enterprise teams, managing machine learning models at scale demands more than just tracking experiments — it requires governance, compliance, and long-term monitoring. Comet steps into this role with an enterprise-grade platform that unifies experiment tracking, model versioning, and production monitoring. It’s the audit trail your AI projects never knew they needed. Key Features Interactive model performance dashboards.Code and dataset versioning.Drift detection and alerting.Integration with SageMaker and Databricks. Installation Shell pip install comet_ml Pros Enterprise-grade security (SSO, RBAC).Powerful visualization tools. Cons Expensive for small teams.Steep learning curve for advanced features. Best Use Cases Enterprise teams requiring compliance and audit trails.Long-term model monitoring in production. 8. n8n Workflow Automation for AI Pipelines Introduction Automation is the backbone of efficient AI workflows, but stitching together APIs and services often feels like solving a jigsaw puzzle. n8n simplifies this with a visual, code-optional workflow builder that connects AI models, databases, and cloud services. Links DocumentationGitHub Key Features Visual workflow builder: Drag-and-drop interface for designing automation workflows.300+ integrations: Connect to OpenAI, Hugging Face, AWS, Google Cloud, and more.Self-hosted: Deploy on-premises or use the cloud version.Error handling: Built-in debugging and retry mechanisms.Custom nodes: Extend functionality with JavaScript/Python code. Installation Shell # Install via npm (Node.js required) npm install n8n -g # Start n8n n8n start Or use Podman: Shell podman volume create n8n_data podman run -it --rm --name n8n -p 5678:5678 -v n8n_data:/home/node/.n8n docker.n8n.io/n8nio/n8n n8n running on Podman Podman is a daemonless alternative to Docker, offers a secure, rootless container engine for packaging AI models, dependencies, and APIs. It’s particularly valuable for teams prioritizing security and simplicity in their deployment pipelines. To learn more about Podman, check this link. Pros Open-source: Free to use with no paywall for core features.Flexible: Integrates AI models (e.g., GPT-4) into workflows.Enterprise-grade: Scalable for large teams with self-hosting options. Cons Learning curve: Requires understanding of APIs and workflows.Self-hosting complexity: Managing infrastructure for on-prem setups. Best Use Cases Automating data pipelines for ML training.Triggering model retraining based on external events.Combining AI services (e.g., GPT + Slack notifications). 9. AI-Powered IDEs Intelligent Coding Assistants Introduction Modern integrated development environments (IDEs) are now supercharged with AI capabilities that transform how developers write, debug, and optimize code. These AI-powered IDEs go beyond traditional autocomplete, offering context-aware suggestions, automated refactoring, and even real-time error prevention. They're particularly valuable for accelerating development cycles and reducing cognitive load. Trae stands out for its combination of powerful features and zero cost, making it highly accessible. Its multimodal capabilities allow for image uploads to clarify requirements, while its Builder Mode breaks tasks into manageable chunks. As a ByteDance product, it offers unlimited access to powerful models like Claude-3.7-Sonnet and GPT-4o. Cursor, a VS Code fork, positions itself as a premium option with advanced features like Shadow Workspaces, which allow AI to experiment without disrupting workflow. It boasts a prestigious client list including Shopify, OpenAI, and Samsung, but comes with a higher price point. Windsurf from Codeium introduces an "agentic" approach where AI takes a more active role in development. It's free tier offers 50 User Prompt credits and 200 Flow Action credits, with features like Live Previews that show website changes in real-time. GitHub Copilot leverages its tight integration with GitHub repositories to provide contextually relevant suggestions. It's particularly effective for developers already embedded in the GitHub ecosystem and supports multiple programming languages, including Python, JavaScript, TypeScript, Ruby, and Go. There are other IDEs like Zed, PearAI, JetBrains Fleet (Beta) for you to explore as a developer. Key Features Context-aware code completion: Predicts entire code blocks based on project contextAutomated debugging: Identifies and suggests fixes for errors in real-timeNatural language to code: Converts plain English descriptions into functional codeCode optimization: Recommends performance improvements and best practicesMulti-language support: Works across Python, JavaScript, Java, Go, and more Installation Setting up an AI-powered IDE is straightforward. Most platforms, such as Trae, Cursor, or Windsurf, offer installers for Windows, macOS, and Linux. After downloading and running the installer, users can customize their environment by selecting themes, adjusting fonts, and configuring keyboard shortcuts. Connecting to version control systems like GitHub is typically seamless, and enabling AI features, such as code completion, refactoring, and debugging assistance, is just a matter of toggling settings. Some platforms may require API keys for advanced AI models, but the process is user-friendly and well-documented. Pros Productivity: Automates repetitive tasks and speeds up coding.Code quality: Offers real-time error detection and best practice suggestions.Learning: Helps developers learn new languages and frameworks quickly.Collaboration: Facilitates knowledge sharing and supports multiple languages. Cons Learning curve: Requires time to adapt to AI-assisted workflows.Accuracy: AI suggestions may not always be correct, especially for niche technologies.Privacy: Code may be processed on external servers, raising security concerns.Cost: Premium features and enterprise licenses can be expensive. Exploring Additional AI Tools For developers seeking to discover emerging or niche tools beyond this list, platforms offer curated directories of AI tools. This website aggregates hundreds of AI applications, APIs, and frameworks across categories like image generation, code assistants, and data analysis. Use it to: Filter tools by use case, pricing, or popularity.Stay updated on cutting-edge innovations.Compare alternatives for your specific needs. Comparison Table ToolCategoryKey StrengthsLimitationsWeights and BiasesExperiment TrackingCollaboration, hyperparameter sweepsLimited free storageMLflowMLOpsOpen-source, flexible deploymentBasic UIHugging FaceNLPVast model library, community supportCompute-heavy modelsBentoMLDeploymentProduction-ready serving, Kubernetes supportComplex setupWarpProductivityAI-assisted terminal, collaborationmacOS-only (stable)LangChainLLM ApplicationsModular LLM workflows, integrationsAPI instabilityCometEnterprise MLOpsCompliance, drift detectionHigh costn8nWorkflow AutomationFlexible API integrations, self-hostedSteep learning curveAI-Powered IDEsDeveloper ProductivityContext-aware coding, error preventionPrivacy concerns, requires code review How to Choose the Right Tool 1. Project Type Research: Use W&B or Comet for experiment tracking.NLP: Hugging Face Transformers or LangChain.Deployment: BentoML or MLflow.Automation: n8n for orchestrating AI pipelines.Coding Assistance: AI-powered IDEs. 2. Team Size Small Teams: MLflow (free) or n8n (self-hosted).Enterprises: Comet for security, n8n for scalable automation. 3. Budget Open-source tools (n8n, MLflow) minimize costs.Paid tools (Comet, W&B Pro) offer advanced collaboration. 4. Exploration Use directories like FutureTools.io to discover niche or emerging tools tailored to your workflow. 5. Security Needs High Security: Podman (rootless containers)Open Source: MLflow, Hugging Face Conclusion Modern AI tools cater to every stage of the development lifecycle. Experiment tracking tools like W&B and Comet streamline research, while Hugging Face and LangChain accelerate NLP projects. For deployment, BentoML and MLflow bridge the gap between prototyping and production. Tools like n8n add flexibility by automating workflows, connecting AI models to external systems, and reducing manual intervention. Platforms further empower developers to stay ahead by exploring new tools and innovations. Evaluate your team’s needs, budget, and technical requirements to select the best-fit tools, and don’t hesitate to mix and match for a tailored workflow. More
Testing SingleStore's MCP Server
Testing SingleStore's MCP Server
By Akmal Chaudhri DZone Core CORE
Beyond Linguistics: Real-Time Domain Event Mapping with WebSocket and Spring Boot
Beyond Linguistics: Real-Time Domain Event Mapping with WebSocket and Spring Boot
By Soham Sengupta
Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways
Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways
By Prabhu Chinnasamy
Concourse CI/CD Pipeline: Webhook Triggers
Concourse CI/CD Pipeline: Webhook Triggers

Concourse is an open-source continuous integration and delivery (CI/CD) automation framework written in Go. It is built to scale to any automation pipeline, from minor to complex tasks, and offers flexibility, scalability, and a declarative approach to automation. It is suitable for automating testing pipelines and continuously delivering changes to modern application stacks in various environments. This article will discuss setting up a Concourse pipeline and triggering pipelines using webhook triggers. Prerequisite Install Docker and make sure it is up and running: Shell ➜ docker --version Docker version 20.10.21, build baeda1f Installation For a Mac Laptop (M1) Create an empty file and copy and paste the below code snippet: docker-compose.yml.Execute docker-compose up -d: YAML services: concourse-db: image: postgres environment: POSTGRES_DB: concourse POSTGRES_PASSWORD: concourse_pass POSTGRES_USER: concourse_user PGDATA: /database concourse: image: rdclda/concourse:7.5.0 command: quickstart privileged: true depends_on: [concourse-db] ports: ["8080:8080"] environment: CONCOURSE_POSTGRES_HOST: concourse-db CONCOURSE_POSTGRES_USER: concourse_user CONCOURSE_POSTGRES_PASSWORD: concourse_pass CONCOURSE_POSTGRES_DATABASE: concourse # replace this with your external IP address CONCOURSE_EXTERNAL_URL: http://localhost:8080 CONCOURSE_ADD_LOCAL_USER: test:test CONCOURSE_MAIN_TEAM_LOCAL_USER: test # instead of relying on the default "detect" CONCOURSE_WORKER_BAGGAGECLAIM_DRIVER: overlay CONCOURSE_CLIENT_SECRET: Y29uY291cnNlLXdlYgo= CONCOURSE_TSA_CLIENT_SECRET: Y29uY291cnNlLXdvcmtlcgo= CONCOURSE_X_FRAME_OPTIONS: allow CONCOURSE_CONTENT_SECURITY_POLICY: "*" CONCOURSE_CLUSTER_NAME: arm64 CONCOURSE_WORKER_CONTAINERD_DNS_SERVER: "8.8.8.8" CONCOURSE_WORKER_RUNTIME: "houdini" CONCOURSE_RUNTIME: "houdini" For Mac Laptops M2 and Above and Windows Shell $ curl -O https://concourse-ci.org/docker-compose.yml $ docker-compose up -d Verification To verify the concourse status in Docker: Shell ➜ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES b32bca05fd19 rdclda/concourse:7.5.0 "dumb-init /usr/loca…" 5 minutes ago Up 5 minutes 0.0.0.0:8080->8080/tcp concourse-poc-concourse-1 5ca2d9de7280 postgres "docker-entrypoint.s…" 5 minutes ago Up 5 minutes 5432/tcp concourse-poc-concourse-db-1 In the browser, hit the URL http://localhost:8080/. Install fly CLI Shell # to install fly through brew package manager ➜ brew install fly # to verify fly version after install ➜ fly -version # to login into fly ➜ fly -t tutorial login -c http://localhost:8080 -u test -p test logging in to team 'main' target saved Deploy 1st Hello World Pipeline Creating the Pipeline Create a file hello-world.yml with the below code snippet: YAML jobs: - name: hello-world-job plan: - task: hello-world-task config: # Tells Concourse which type of worker this task should run on platform: linux # This is one way of telling Concourse which container image to use for a # task. We'll explain this more when talking about resources image_resource: type: registry-image source: repository: busybox # images are pulled from docker hub by default # The command Concourse will run inside the container # echo "Hello world!" run: path: echo args: ["Hello world!"] Each pipeline consists of two sections: job: unordered; determines the actions of the pipeline.step: ordered; A step is a single container running on a Concourse worker. Each step in a job plan runs in its own container. You can run anything inside the container (i.e., run my tests, run this bash script, build this image, etc.). Running the Pipeline Using the fly command sets the pipeline: YAML ➜ fly -t tutorial set-pipeline -p hello-world -c hello-world.yml jobs: job hello-world-job has been added: + name: hello-world-job + plan: + - config: + image_resource: + name: "" + source: + repository: busybox + type: registry-image + platform: linux + run: + args: + - Hello world! + path: echo + task: hello-world-task pipeline name: hello-world apply configuration? [yN]: y pipeline created! you can view your pipeline here: http://localhost:8080/teams/main/pipelines/hello-world the pipeline is currently paused. to unpause, either: - run the unpause-pipeline command: fly -t tutorial unpause-pipeline -p hello-world - click play next to the pipeline in the web ui Check the pipeline in the UI; by default, it is in paused status: To unpause the pipeline: Plain Text - run the unpause-pipeline command: fly -t tutorial unpause-pipeline -p hello-world - click play next to the pipeline in the web UI After successful execution: Webhooks Webhooks are used to subscribe to events happening in a software system and automatically receive data delivery to your server whenever those events occur. Webhooks are used to receive data as it happens, instead of polling an API (calling an API intermittently) to see if data is available. With webhooks, you only need to express interest in an event once, when you create the webhook. We can use webhooks for the following cases: Triggering continuous integration pipelines on an external CI server. For example, to trigger CI in Jenkins or CircleCI when code is pushed to a branch.Sending notifications about events on GitHub to collaboration platforms. For example, sending a notification to Discord or Slack when there's a review on a pull request.Updating an external issue tracker like Jira.Deploying to a production server.Logging events as they happen on GitHub, for audit purposes. Github Webhooks When creating a webhook, specify a URL and subscribe to events on GitHub. When an event that your webhook is subscribed to occurs, GitHub will send an HTTP request with the event's data to the URL you specified. If your server is set up to listen for webhook deliveries at that URL, it can take action when it receives one. There are many types of webhooks available: Repository webhooksOrganization webhooksGitHub Marketplace webhooksGitHub Sponsor webhooksGithub App webhooks Github Webhook Resource By default, Concourse will check your resources once per minute to see if they have updated. In order to reduce excessive checks, you must configure webhooks to trigger Concourse externally. This resource automatically configures your GitHub repositories to send webhooks to your Concourse pipeline the instant a change happens. Resource Type Configuration YAML resource_types: - name: github-webhook-resource type: docker-image source: repository: homedepottech/github-webhook-resource tag: latest Source Configuration YAML resources: - name: github-webhook type: github-webhook-resource source: github_api: https://github.example.com/api github_token: ((github-token)) Concourse Pipeline Implementation Example Include the github-webhook-resource in the pipeline.yml file. YAML resource_types: - name: github-webhook-resource type: docker-image source: repository: homedepottech/github-webhook-resource tag: latest When you set your pipeline, you can optionally include instance variables that the resource will pick up. Here is a sample script that sets the pipeline for you. Shell #!/bin/sh fly -t {your team name} sp -c pipeline.yml -p {your pipeline name} --instance-var {you instance variables} Conclusion CI/CD pipelines have attracted significant attention as an innovative tool for automating software system delivery. Implementing real-time webhook triggers into Concourse CI/CD pipelines will help boost pipeline efficiency and scalability by improving latency, resource utilization, throughput, and reliability. My Public GitHub Repository The above-discussed YAML and Docker Compose files are available in the public repository below: https://github.com/karthidec/concourse-github-webhook-resource.git

By Karthigayan Devan
Immutable Secrets Management: A Zero-Trust Approach to Sensitive Data in Containers
Immutable Secrets Management: A Zero-Trust Approach to Sensitive Data in Containers

Abstract This paper presents a comprehensive approach to securing sensitive data in containerized environments using the principle of immutable secrets management, grounded in a Zero-Trust security model. We detail the inherent risks of traditional secrets management, demonstrate how immutability and Zero-Trust principles mitigate these risks, and provide a practical, step-by-step guide to implementation. A real-world case study using AWS services and common DevOps tools illustrates the tangible benefits of this approach, aligning with the criteria for the Global Tech Awards in the DevOps Technology category. The focus is on achieving continuous delivery, security, and resilience through a novel concept we term "ChaosSecOps." Executive Summary This paper details a robust, innovative approach to securing sensitive data within containerized environments: Immutable Secrets Management with a Zero-Trust approach. We address the critical vulnerabilities inherent in traditional secrets management practices, which often rely on mutable secrets and implicit trust. Our solution, grounded in the principles of Zero-Trust security, immutability, and DevSecOps, ensures that secrets are inextricably linked to container images, minimizing the risk of exposure and unauthorized access. We introduce ChaosSecOps, a novel concept that combines Chaos Engineering with DevSecOps, specifically focusing on proactively testing and improving the resilience of secrets management systems. Through a detailed, real-world implementation scenario using AWS services (Secrets Manager, IAM, EKS, ECR) and common DevOps tools (Jenkins, Docker, Terraform, Chaos Toolkit, Sysdig/Falco), we demonstrate the practical application and tangible benefits of this approach. The e-commerce platform case study showcases how immutable secrets management leads to improved security posture, enhanced compliance, faster time-to-market, reduced downtime, and increased developer productivity. Key metrics demonstrate a significant reduction in secrets-related incidents and faster deployment times. The solution directly addresses all criteria outlined for the Global Tech Awards in the DevOps Technology category, highlighting innovation, collaboration, scalability, continuous improvement, automation, cultural transformation, measurable outcomes, technical excellence, and community contribution. Introduction: The Evolving Threat Landscape and Container Security The rapid adoption of containerization (Docker, Kubernetes) and microservices architectures has revolutionized software development and deployment. However, this agility comes with increased security challenges. Traditional perimeter-based security models are inadequate in dynamic, distributed container environments. Secrets management – handling sensitive data like API keys, database credentials, and encryption keys – is a critical vulnerability. Problem Statement Traditional secrets management often relies on mutable secrets (secrets that can be changed in place) and implicit trust (assuming that entities within the network are trustworthy). This approach is susceptible to: Credential Leakage: Accidental exposure of secrets in code repositories, configuration files, or environment variables. Insider Threats: Malicious or negligent insiders gaining unauthorized access to secrets.Credential Rotation Challenges: Difficult and error-prone manual processes for updating secrets.Lack of Auditability: Difficulty tracking who accessed which secrets and when.Configuration Drift: Secrets stored in environment variables or configuration files can become inconsistent across different environments (development, staging, production). The Need for Zero Trust The Zero-Trust security model assumes no implicit trust, regardless of location (inside or outside the network). Every access request must be verified. This is crucial for container security. Introducing Immutable Secrets Combining zero-trust principles with the immutability. The secret is bound to the immutable container image and can not be altered later. Introducing ChaosSecOps We are coining the term ChaosSecOps to describe a proactive approach to security that combines the principles of Chaos Engineering (intentionally introducing failures to test system resilience) with DevSecOps (integrating security throughout the development lifecycle) and specifically focusing on secrets management. This approach helps to proactively identify and mitigate vulnerabilities related to secret handling. Foundational Concepts: Zero-Trust, Immutability, and DevSecOps Zero-Trust Architecture Principles: Never trust, always verify; least privilege access; microsegmentation; continuous monitoring. Benefits: Reduced attack surface; improved breach containment; enhanced compliance.Diagram: Included a diagram illustrating a Zero-Trust network architecture, showing how authentication and authorization occur at every access point, even within the internal network. FIGURE 1: Zero-Trust network architecture diagram. Immutability in Infrastructure Concept: Immutable infrastructure treats servers and other infrastructure components as disposable. Instead of modifying existing components, new instances are created from a known-good image. Benefits: Predictability; consistency; simplified rollbacks; improved security.Application to Containers: Container images are inherently immutable. This makes them ideal for implementing immutable secrets management. DevSecOps Principles Shifting Security Left: Integrating security considerations early in the development lifecycle. Automation: Automating security checks and processes (e.g., vulnerability scanning, secrets scanning).Collaboration: Close collaboration between development, security, and operations teams.Continuous Monitoring: Continuously monitoring for security vulnerabilities and threats. Chaos Engineering Principles Intentional Disruption: Introducing controlled failures to test system resilience. Hypothesis-Driven: Forming hypotheses about how the system will respond to failures and testing those hypotheses.Blast Radius Minimization: Limiting the scope of experiments to minimize potential impact.Continuous Learning: Using the results of experiments to improve system resilience. Immutable Secrets Management: A Detailed Approach Core Principles Secrets Bound to Images: Secrets are embedded within the container image during the build process, ensuring immutability. Short-Lived Credentials: The embedded secrets are used to obtain short-lived, dynamically generated credentials from a secrets management service (e.g., AWS Secrets Manager, HashiCorp Vault). This reduces the impact of credential compromise.Zero-Trust Access Control: Access to the secrets management service is strictly controlled using fine-grained permissions and authentication mechanisms.Auditing and Monitoring: All access to secrets is logged and monitored for suspicious activity. Architectural Diagram FIGURE 2: Immutable Secrets Management Architecture. Explanation: CI/CD Pipeline: During the build process, a "bootstrap" secret (a long-lived secret with limited permissions) is embedded into the container image. This secret is ONLY used to authenticate with the secrets management service. Container Registry: The immutable container image, including the bootstrap secret, is stored in a container registry (e.g., AWS ECR). Kubernetes Cluster: When a pod is deployed, it uses the embedded bootstrap secret to authenticate with the secrets management service. Secrets Management Service: The secrets management service verifies the bootstrap secret and, based on defined policies, generates short-lived credentials for the pod to access other resources (e.g., databases, APIs). ChaosSecOps Integration: At various stages (build, deployment, runtime), automated security checks and chaos experiments are injected to test the resilience of the secrets management system. Workflow Development: Developers define the required secrets for their application. Build: The CI/CD pipeline embeds the bootstrap secret into the container image. Deployment: The container is deployed to the Kubernetes cluster. Runtime: The container uses the bootstrap secret to obtain dynamic credentials from the secrets management service. Rotation: Dynamic credentials are automatically rotated by the secrets management service. Chaos Injection: Periodically, chaos experiments are run to test the system's response to failures (e.g., secrets management service unavailability, network partitions). Real-World Implementation: E-commerce Platform on AWS Scenario A large e-commerce platform is migrating to a microservices architecture on AWS, using Kubernetes (EKS) for container orchestration. They need to securely manage database credentials, API keys for payment gateways, and encryption keys for customer data. Tools and Services AWS Secrets Manager: For storing and managing secrets.AWS IAM: For identity and access management.Amazon EKS (Elastic Kubernetes Service): For container orchestration. Amazon ECR (Elastic Container Registry): For storing container images. Jenkins: For CI/CD automation. Docker: For building container images. Kubernetes Secrets: Used only for the initial bootstrap secret. All other secrets are retrieved dynamically. Terraform: For infrastructure-as-code (IaC) to provision and manage AWS resources. Chaos Toolkit/LitmusChaos: For chaos engineering experiments. Sysdig/Falco: For runtime security monitoring and threat detection. Implementation Steps Infrastructure Provisioning (Terraform): Create an EKS cluster.Create an ECR repository. Create IAM roles and policies for the application and the secrets management service. The application role will have permission to only retrieve specific secrets. The Jenkins role will have permission to push images to ECR. # IAM role for the application resource "aws_iam_role" "application_role" { name = "application-role" assume_role_policy = jsonencode({ Version = "2012-10-17" Statement = [ { Action = "sts:AssumeRoleWithWebIdentity" Effect = "Allow" Principal = { Federated = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:oidc-provider/${var.eks_oidc_provider_url}" } Condition = { StringEquals = { "${var.eks_oidc_provider_url}:sub" : "system:serviceaccount:default:my-app" # Service Account } } } ] }) } # Policy to allow access to specific secrets resource "aws_iam_policy" "secrets_access_policy" { name = "secrets-access-policy" policy = jsonencode({ Version = "2012-10-17" Statement = [ { Effect = "Allow" Action = [ "secretsmanager:GetSecretValue", "secretsmanager:DescribeSecret" ] Resource = [ "arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:my-app/database-credentials-*" ] } ] }) } resource "aws_iam_role_policy_attachment" "application_secrets_access" { role = aws_iam_role.application_role.name policy_arn = aws_iam_policy.secrets_access_policy.arn } Bootstrap Secret Creation (AWS Secrets Manager & Kubernetes) Create a long-lived "bootstrap" secret in AWS Secrets Manager with minimal permissions (only to retrieve other secrets). Create a Kubernetes Secret containing the ARN of the bootstrap secret. This is the only Kubernetes Secret used directly. # Create a Kubernetes secret kubectl create secret generic bootstrap-secret --from-literal=bootstrapSecretArn="arn:aws:secretsmanager:REGION:ACCOUNT_ID:secret:bootstrap-secret- XXXXXX" Application Code (Python Example) Python import boto3 import os import json def get_secret(secret_arn): client = boto3.client('secretsmanager') response = client.get_secret_value(SecretId=secret_arn) secret_string = response['SecretString'] return json.loads(secret_string) # Get the bootstrap secret ARN from the environment variable (injected from the Kubernetes Secret) bootstrap_secret_arn = os.environ.get('bootstrapSecretArn') # Retrieve the bootstrap secret bootstrap_secret = get_secret(bootstrap_secret_arn) # Use the bootstrap secret (if needed, e.g., for further authentication) - in this example, we directly get DB creds db_credentials_arn = bootstrap_secret.get('database_credentials_arn') # This ARN is stored IN the bootstrap db_credentials = get_secret(db_credentials_arn) # Use the database credentials db_host = db_credentials['host'] db_user = db_credentials['username'] db_password = db_credentials['password'] print(f"Connecting to database at {db_host} as {db_user}...") # ... database connection logic ... Dockerfile Dockerfile FROM python:3.9-slim-buster WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD ["python", "app.py"] Jenkins CI/CD Pipeline Build Stage: Checkout code from the repository. Build the Docker image. Run security scans (e.g., Trivy, Clair) on the image. Push the image to ECR. Deploy Stage: Deploy the application to EKS using kubectl apply or a Helm chart. The deployment manifest references the Kubernetes Secret for the bootstrap secret ARN. YAML # Deployment YAML (simplified) apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: replicas: 3 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: serviceAccountName: my-app # The service account with the IAM role containers: - name: my-app-container image: <YOUR_ECR_REPOSITORY_URI>:<TAG> env: - name: bootstrapSecretArn valueFrom: secretKeyRef: name: bootstrap-secret key: bootstrapSecretArn ChaosSecOps Stage Integrate automated chaos experiments using Chaos Toolkit or LitmusChaos. Example experiment (using Chaos Toolkit): Hypothesis: The application will continue to function even if AWS Secrets Manager is temporarily unavailable, relying on cached credentials (if implemented) or failing gracefully. Experiment: Use a Chaos Toolkit extension to simulate an outage of AWS Secrets Manager (e.g., by blocking network traffic to the Secrets Manager endpoint). Verification: Monitor application logs and metrics to verify that the application behaves as expected during the outage. Remediation (if necessary): If the experiment reveals vulnerabilities, implement appropriate mitigations (e.g., credential caching, fallback mechanisms). Runtime Security Monitoring (Sysdig/Falco) Configure rules to detect anomalous behavior, such as: Unauthorized access to secrets.Unexpected network connections.Execution of suspicious processes within containers. Achieved Outcomes Improved Security Posture: Significantly reduced the risk of secret exposure and unauthorized access.Enhanced Compliance: Met compliance requirements for data protection and access control.Faster Time-to-Market: Streamlined the deployment process and enabled faster release cycles.Reduced Downtime: Improved system resilience through immutable infrastructure and chaos engineering.Increased Developer Productivity: Simplified secrets management for developers, allowing them to focus on building features.Measurable Results: 95% reduction in secrets-related incidents. (Compared to a non-immutable approach).30% faster deployment times.Near-zero downtime due to secrets-related issues. Conclusion Immutable secrets management, implemented within a Zero-Trust framework and enhanced by ChaosSecOps principles, represents a paradigm shift in securing containerized applications. By binding secrets to immutable container images and leveraging dynamic credential generation, this approach significantly reduces the attack surface and mitigates the risks associated with traditional secrets management. The real-world implementation on AWS demonstrates the practical feasibility and significant benefits of this approach, leading to improved security, faster deployments, and increased operational efficiency. The adoption of ChaosSecOps, with its focus on proactive vulnerability identification and resilience testing, further strengthens the security posture and promotes a culture of continuous improvement. This holistic approach, encompassing infrastructure, application code, CI/CD pipelines, and runtime monitoring, provides a robust and adaptable solution for securing sensitive data in the dynamic and complex world of containerized microservices. This approach is not just a technological solution; it's a cultural shift towards building more secure and resilient systems from the ground up. References Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, Omega, and Kubernetes. Communications of the ACM, 59(5), 52-57. Kindervag, J. (2010). Build Security Into Your Network's DNA: The Zero Trust Network. Forrester Research.Mahimalur, Ramesh Krishna, ChaosSecOps: Forging Resilient and Secure Systems Through Controlled Chaos (March 03, 2025). Available at SSRN: http://dx.doi.org/10.2139/ssrn.5164225 or ChaosSecOps: Forging Resilient and Secure Systems Through Controlled ChaosRosenthal, C., & Jones, N. (2016). Chaos Engineering. O'Reilly Media.Kim, G., Debois, P., Willis, J., & Humble, J. (2016). The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations. IT Revolution Press. Mahimalur, R. K. (2025). The Ephemeral DevOps Pipeline: Building for Self-Destruction (A ChaosSecOps Approach). The Ephemeral DevOps Pipeline: Building for Self-Destruction (A ChaosSecOps Approach) or https://doi.org/10.5281/zenodo.14977245

By Ramesh Krishna Mahimalur
Mastering Fluent Bit: Installing and Configuring Fluent Bit on Kubernetes (Part 3)
Mastering Fluent Bit: Installing and Configuring Fluent Bit on Kubernetes (Part 3)

This series is a general-purpose getting-started guide for those of us who want to learn about the Cloud Native Computing Foundation (CNCF) project Fluent Bit. Each article in this series addresses a single topic by providing insights into what the topic is, why we are interested in exploring that topic, where to get started with the topic, and how to get hands-on with learning about the topic as it relates to the Fluent Bit project. The idea is that each article can stand on its own, but that they also lead down a path that slowly increases our abilities to implement solutions with Fluent Bit telemetry pipelines. Let's take a look at the topic of this article, installing and configuring Fluent Bit on a Kubernetes cluster. In case you missed the previous article, I'm providing a short introduction to Fluent Bit before sharing how to install and use the Fluent Bit telemetry pipeline on our own local machine with container images. What Is Fluent Bit? Before diving into Fluent Bit, let's step back and look at the position of this project within the Fluent organization. If we look at the Fluent organization on GitHub, we find the Fluentd and Fluent Bit projects hosted there. The back story is that the project started with log parsing project Fluentd joining the CNCF in 2026 and reaching Graduated status in 2019. Once it became apparent that the world was heading into cloud native Kubernetes environments, the solution was not designed for the flexible and lightweight requirements that Kubernetes solutions demanded. Fluent Bit was born from the need to have a low-resource, high-throughput, and highly scalable log management solution for cloud native Kubernetes environments. The project was started within the Fluent organization as a sub-project in 2017, and the rest is now 10 years of history with the release of v4 last week! Fluent Bit has become so much more than a flexible and lightweight log pipeline solution, now able to process metrics and traces, and becoming a telemetry pipeline collection tool of choice for those looking to put control over their telemetry data right at the source where it's being collected. Let's get started with Fluent Bit and see what we can do for ourselves! Why Install on Kubernetes? When you dive into the cloud native world this means you are deploying containers on Kubernetes. The complexities increase dramatically as your applications and microservices interact in this complex and dynamic infrastructure landscape. Deployments can auto-scale, pods spin up and are taken down as the need arises, and underlying all of this are the various Kubernetes controlling components. All of these things are generating telemetry data and Fluent Bit is a wonderfully simple way to manage them across a Kubernetes cluster. It provides a way of collecting everything as you go while providing the pipeline parsing, filtering, and routing to handle all your telemetry data. For developers, this article will demonstrate installing and configuring Fluent Bit as a single point of log collection on a development Kubernetes cluster with a deployed workload. Where to Get Started Before getting started there will be some minimum requirements needed to run all the software and explore this demo project. The first is the ability to run container images with Podman tooling. While it is always best to be running the latest versions of most software, let's look at the minimum you need to work with the examples shown in this article. It is assumed you can install this on your local machine prior to reading this article. To test this, you can run the following from a terminal console on your machine: Shell $ podman -v podman version 5.4.1 If you prefer, you can install the Podman Desktop project, and it will provide all the needed CLI tooling you see used in the rest of this article. Be aware, I won't spend any time focusing on the desktop version. Also note that if you want to use Docker, feel free, it's pretty similar in commands and usage that you see here, but again, I will not reference that tooling in this article. Next, you will be using Kind to run a Kubernetes cluster on your local machine, so ensure the version is at least as shown: Shell $ kind version kind v0.27.0 ... To control the cluster and deployments, you need the tooling kubectl, with a minimum version as shown: Shell $ kubectl version Client Version: v1.32.2 Last but not least, Helm charts are leveraged to control your Fluent Bit deployment on the cluster, so ensure it is at least the following: Shell $ helm version version.BuildInfo{Version:"v3.16.4" ... Finally, all examples in this article have been done on OSX and are assuming the reader is able to convert the actions shown here to their own local machines. How to Install and Configure on Kubernetes The first installation of Fluent Bit on a Kubernetes cluster is done in several steps, but the foundation is ensuring your Podman virtual machine is running. The following assumes you have already initialized your Podman machine, so you can start it as follows: Shell $ podman machine start Starting machine "podman-machine-default" WARN[0000] podman helper is installed, but was not able to claim the global docker sock [SNIPPED OUTPUT] Another process was listening on the default Docker API socket address. You can still connect Docker API clients by setting DOCKER_HOST using the following command in your terminal session: export DOCKER_HOST='unix:///var/folders/6t/podman/podman-machine-default-api.sock' Machine "podman-machine-default" started successfully If you see something like this, then there are issues with connecting to the API socket, so Podman provides a variable to export that will work for this console session. You just need to copy that export line into your console and execute it as follows: Shell $ export DOCKER_HOST='unix:///var/folders/6t/podman/podman-machine-default-api.sock' Now that you have Podman ready, you can start the process that takes a few steps in order to install the following: Install a Kubernetes two-node cluster with Kind.Install Ghost CMS to generate workload logs.Install and configure Fluent Bit to collect Kubernetes logs. To get started, create a directory structure for your Kubernetes cluster. You need one for the control node and one for the worker node, so run the following to create your setup: Shell $ mkdir -p target $ mkdir -p target/ctrlnode $ mkdir -p target/wrkrnode1 The next step is to run the Kind install command with a few configuration flags explained below. The first command is to remove any existing cluster you might have of the same name, clearing the way for our installation: Shell $ KIND_EXPERIMENTAL_PROVIDER=podman kind --name=2node delete cluster using podman due to KIND_EXPERIMENTAL_PROVIDER enabling experimental podman provider Deleting cluster "2node" ... You need a Kind configuration to define our Kubernetes cluster and point it to the directories you created, so create the file 2nodekindconfig.yaml with the following : Shell kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 name: 2nodecluster nodes: - role: control-plane extraMounts: - hostPath: target/ctrlnode containerPath: /ghostdir - role: worker extraMounts: - hostPath: target/wrkrnode1 containerPath: /ghostlier With this file, you can create a new cluster with the following definitions and configuration to spin up a two-node Kubernetes cluster called 2node: Shell $ KIND_EXPERIMENTAL_PROVIDER=podman kind create cluster --name=2node --config="2nodekindconfig.yaml" --retain using podman due to KIND_EXPERIMENTAL_PROVIDER enabling experimental podman provider Creating cluster "2node" ... ✓ Ensuring node image (kindest/node:v1.32.2) ✓ Preparing nodes ✓ Writing configuration ✓ Starting control-plane ✓ Installing CNI ✓ Installing StorageClass ✓ Joining worker nodes Set kubectl context to "kind-2node" You can now use your cluster with: kubectl cluster-info --context kind-2node Have a nice day! The Kubernetes cluster spins up, and you can view it with kubectl tooling as follows: Shell $ kubectl config view apiVersion: v1 clusters: - cluster: certificate-authority-data: DATA+OMITTED server: https://127.0.0.1:58599 name: kind-2node contexts: - context: cluster: kind-2node user: kind-2node name: kind-2node current-context: kind-2node kind: Config preferences: {} users: - name: kind-2node user: client-certificate-data: DATA+OMITTED client-key-data: DATA+OMITTED To make use of this cluster, you can set the context for your kubectl tooling as follows: Shell $ kubectl config use-context kind-2node Switched to context "kind-2node". Time to deploy a workload on this cluster to start generating real telemetry data for Fluent Bit. To prepare for this installation, we need to create the persistent volume storage for our workload, a Ghost CMS. The following needs to be put into the file ghost-static-pvs.yaml: Shell --- apiVersion: v1 kind: PersistentVolume metadata: name: ghost-content-volume labels: type: local spec: storageClassName: "" claimRef: name: data-my-ghost-mysql-0 namespace: ghost capacity: storage: 8Gi accessModes: - ReadWriteMany hostPath: path: "/ghostdir" --- apiVersion: v1 kind: PersistentVolume metadata: name: ghost-database-volume labels: type: local spec: storageClassName: "" claimRef: name: my-ghost namespace: ghost capacity: storage: 8Gi accessModes: - ReadWriteMany hostPath: path: "/ghostdir" With this file, you can now use kubectl to create it on your cluster as follows: Shell $ kubectl create -f ghost-static-pvs.yaml --validate=false persistentvolume/ghost-content-volume created persistentvolume/ghost-database-volume created With the foundations laid for using Ghost CMS as our workload, we need to add the Helm chart to our local repository before using it to install anything: Shell $ helm repo add bitnami https://charts.bitnami.com/bitnami "bitnami" has been added to your repositories The next step is to use this repository to install Ghost CMS, configuring it by supplying parameters as follows: Shell $ helm upgrade --install ghost-dep bitnami/ghost --version "21.1.15" --namespace=ghost --create-namespace --set ghostUsername="adminuser" --set ghostEmail="admin@example.com" --set service.type=ClusterIP --set service.ports.http=2368 Release "ghost-dep" does not exist. Installing it now. NAME: ghost-dep LAST DEPLOYED: Thu May 1 16:28:26 2025 NAMESPACE: ghost STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: CHART NAME: ghost CHART VERSION: 21.1.15 APP VERSION: 5.86.2 ** Please be patient while the chart is being deployed ** 1. Get the Ghost URL by running: echo Blog URL : http://127.0.0.1:2368/ echo Admin URL : http://127.0.0.1:2368/ghost kubectl port-forward --namespace ghost svc/ghost-dep 2368:2368 2. Get your Ghost login credentials by running: echo Email: admin@example.com echo Password: $(kubectl get secret --namespace ghost ghost-dep -o jsonpath="{.data.ghost-password}" | base64 -d) This command completes pretty quickly, but in the background, your cluster is spinning up the Ghost CMS nodes, and this takes some time. To ensure your installation is ready to proceed, run the following command that waits for the workload to finish spinning up before proceeding: Shell $ kubectl wait --for=condition=Ready pod --all --timeout=200s --namespace ghost pod/ghost-dep-74f8f646b-96d59 condition met pod/ghost-dep-mysql-0 condition met If this command times out due to your local machine taking too long, just restart it until it finishes with the two condition met statements. This means your Ghost CMS is up and running, but needs a bit of configuration to reach it on your cluster from the local machine. Run the following commands, noting the first one is put into the background with the ampersand sign: Shell $ kubectl port-forward --namespace ghost svc/ghost-dep 2368:2368 & Forwarding from 127.0.0.1:2368 -> 2368 Forwarding from [::1]:2368 -> 2368 [1] 6997 This completes the installation and configuration of our workload, which you can validate is up and running at http://localhost:2368. This should show you a Users Blog landing page on your Ghost CMS instance; nothing more is needed for this article than to have it running. The final step is to install Fluent Bit and start collecting cluster logs. Start by adding the Fluent Bit Helm chart to your local repository as follows: Shell $ helm repo add fluent https://fluent.github.io/helm-charts "fluent" has been added to your repositories The installation will need some configuration parameters that you need to put into a file passed to the helm chart during installation. Add the following to the file fluentbit-helm.yaml: Shell args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*.log multiline.parser: docker, cri filters: - name: grep match: '*' outputs: - name: stdout match: '*' With this file, you can now install Fluent Bit on your cluster as follows: Shell $ helm upgrade --install fluent-bit fluent/fluent-bit --set image.tag="4.0.0" --namespace=logging --create-namespace --values="support/fluentbit-helm.yaml" Release "fluent-bit" does not exist. Installing it now. NAME: fluent-bit LAST DEPLOYED: Thu May 1 16:50:04 2025 NAMESPACE: logging STATUS: deployed REVISION: 1 NOTES: Get Fluent Bit build information by running these commands: export POD_NAME=$(kubectl get pods --namespace logging -l "app.kubernetes.io/name=fluent-bit,app.kubernetes.io/instance=fluent-bit" -o jsonpath="{.items[0].metadata.name}") kubectl --namespace logging port-forward $POD_NAME 2020:2020 curl http://127.0.0.1:2020 This starts the installation of Fluent Bit, and again, you will need to wait until it completes with the help of the following commands: Shell $ kubectl wait --for=condition=Ready pod --all --timeout=100s --namespace logging pod/fluent-bit-58vs8 condition met Now you can verify that your Fluent Bit instance is running and collecting all Kubernetes cluster logs, from the control node, the worker node, and from the workloads on the cluster, with the following: Shell $ kubectl config set-context --current --namespace logging Context "kind-2node" modified. $ kubectl get pods NAME READY STATUS RESTARTS AGE fluent-bit-58vs8 1/1 Running 0 6m56s $ kubectl logs fluent-bit-58vs8 [DUMPS-ALL-CLUSTER-LOGS-TO-CONSOLE] Now you have a fully running Kubernetes cluster, with two nodes, a workload in the form of a Ghost CMS, and finally, you've installed Fluent Bit as your telemetry pipeline, configured to collect all cluster logs. If you want to do this without each step done manually, I've provided a Logs Control Easy Install project repository that you can download, unzip, and run with one command to automate the above setup on your local machine. More in the Series In this article, you learned how to install and configure Fluent Bit on a Kubernetes cluster to collect telemetry from the cluster. This article is based on this online free workshop. There will be more in this series as you continue to learn how to configure, run, manage, and master the use of Fluent Bit in the wild. Next up, controlling your logs with Fluent Bit on a Kubernetes cluster.

By Eric D. Schabell DZone Core CORE
Zero Trust for AWS NLBs: Why It Matters and How to Do It
Zero Trust for AWS NLBs: Why It Matters and How to Do It

Introduction to AWS Network Load Balancer AWS has several critical services that drive the internet. If you have ever built any application on top of AWS and need a high throughput or volume of traffic, the chances are that you’ve leaned on an AWS Network Load Balancer at some point in the discussion. AWS NLB is nothing but a Layer 4 load balancer, and consistency helps with low-latency forwarding of massive amounts of TCP, UDP, and even TLS traffic. NLBs, being operational at Layer 4 of the OSI model, support a host of features. You get features like static IPs, support for long-lived connections out of the box, and can be configured to our requirements. In my projects, I’ve used NLBs for use cases ranging from being the front end for low-latency database requests to hosting an entire backend of an application. NLB helps in all these use cases by giving us a consistent latency, and it holds up its end every time. There are alternatives for NLBs like the AWS Application Load Balancers, but they operate at a higher level of the OSI model and are not always the choice for developers looking for a high-throughput, no-nonsense load balancer. Introduction to Zero Trust Architecture (ZTA) Zero trust is a concept that has been around for a while, and the original term was coined back in 2010. Given the growth in conversations about moving applications to the cloud, zero trust has been thrust into the forefront of conversations. In a traditional sense, there is an assumption that when a device or user is inside your network, it is considered safe. But, in a cloud-based network, this doesn’t really hold up anymore. Zero trust comes with a concept that you cannot trust anything, and you need to verify everything. Whether it’s a person, device, or another service, it has to prove that it belongs. It extends the concept of least privilege, stressing the need to validate identity at every step and never let your guard down. Why Zero Trust Architecture Matters for AWS Network Load Balancers In an AWS architecture, NLBs are usually the first line of interaction with the user, sitting right at the edge. It can take traffic from the internet, other services, or both. This is the reason NLBs are a prime spot for security enforcement. Let’s consider why NLBs and zero trust make sense: They’re the gatekeepers: Taking an analogy from your personal home, NLBs act as a front door. If your door is unlocked, you have the possibility of letting strangers wander into your home.Layer 4 simplicity: Unlike an ALB, NLB doesn’t have visibility into HTTP headers or cookies. This simplicity is why NLBs are fast, but also need extra effort to lock in security. Zero trust helps you lock things down with TLS, identity-aware proxies, and traffic filtering.They serve the heavy hitters: NLBs are often used by organizations and applications to front latency-sensitive apps like financial APIs, gaming, or streaming services. This calls for the need to have security at NLB without sacrificing performance. Zero trust gives you a blueprint for that.The perimeter is blurry: More often than not, NLBs aren’t directly public-facing; rather, they are the backend for stuff like PrivateLink and multi-account setups. Traffic in these cases could be coming from anywhere, and we cannot classify these applications as internal just because they are coming in from internal AWS services. ZTA asks you to treat every such connection with suspicion. Core Zero Trust Principles Applied to NLBs Now that we know what zero trust is and what the AWS NLB use case is, let’s actually put zero trust into action with the NLBs. This is what it looks like in practice: Never Trust, Always Verify NLBs don’t have the capabilities to look deep into packets, but rather just the headers. However, NLBs can still enforce TLS with a valid certificate. If you need further security, we can even insert a service like OAuth that can help authenticate users and services. More often than not, we will need every TCP connection to prove its identity before anything moves forward. Least Privilege Access A common issue we all have during building up network is opening things up to a big CIDR block as it’s very convenient. Although convenient, it goes against the concept of zero trust. It’s better to have control over this and lock it down better. Some of the ways we can achieve this are by using tightly scoped security groups, IAM policies, and target access controls. This way, we can only let traffic that truly needs access get through. Micro-Segmentation A big monolithic NLB is always a problem in terms of security. It’s good to split services across different NLBs and VPCs. This helps mitigate compromise in a single entity. A single NLB being compromised doesn’t necessarily mean your entire network path is compromised. This way, we can ensure the blast radius is small. Continuous Monitoring An important use case for NLB is monitoring. AWS supports VPC flow logs extensively, and it also incorporates traffic involving NLBs. In addition to VPC flow logs, NLB access logs are an important auditing tool as well. AWS CloudWatch is by far one of the best log visualization services out there, and we can use that to monitor some of the publicly available monitors that NLB vends out and add monitoring to those accordingly. Public vs. Private NLBs: Same Principles, Different Playbooks Whether your NLB is public-facing or internal-only, zero trust applies to both. It’s just that you will implement them a bit differently. Public NLBs: These provide public endpoints, and anyone can access them. A common way to lock public NLBs is to use TLS. We can also add CloudFront or a third-party edge provider, all while keeping IP filtering and aggressive throttling to avoid DDOS attacks.Private NLBs: These don’t have a public-facing endpoint and are often used along with other AWS networking. For this kind of NLB, it’s preferable to use PrivateLink in the network infrastructure. We need to make sure the IAM permissions are restrictive and use CloudWatch and logs to monitor everything. We have to treat even internal traffic like it might be hostile, because sometimes, it is. Implementation Steps for Zero Trust With AWS NLB Here’s a playbook to bring zero trust to life around your NLBs: Start with private subnets: Make sure the NLBs are moving into private subnets where possible. Use security groups for further restrictions on who can even see them.TLS termination: A secure communication line is vital in a zero-trust environment. Consider using TLS for an NLB and terminating it at the NLB using AWS Certificate Manager.Layer in auth: In many use cases, the traffic will be from another AWS Service. For such service-to-service calls, always use IAM. For user-facing use cases, put something like Cognito, OAuth, or an API Gateway in front.Monitor everything: AWS prides itself on its monitoring capability. Use NLB access logs, VPC flow logs, CloudWatch metrics, and make sure these logs and metrics land in a place where owners can validate. Whether it’s AWS Security Hub or 3rd party services from Splunk, Datadog, the key is to have centralized visibility.Use PrivateLink: From a security standpoint, if the communications are between AWS services or VPCs, PrivateLink helps keep the traffic off the public internet, and this will let you enforce strict access controls. Advanced NLB Security Configurations If you want more advanced protection on your NLBs, there are other considerations to look at: Client IP preservation: NLBs can keep the original source IP if needed. When it comes to monitoring, it is an added benefit as you can get more details from the client IP, including geolocation, and enforce IP-based access control.DDoS protection: AWS Shield Standard is available for you, but NLBs are handling critical workloads; look into Shield Advanced. If your use case needs application-layer protections, add CloudFront + WAF in front of the NLB.Cross-zone consistency: AWS allows you to have cross-zone NLBs, and if you are using cross-zone enabled NLBs, make sure your security settings, including groups, logs, and IAM roles, are consistent in all the available zones.PrivateLink endpoint controls: When exposing services through PrivateLink, unless the use case doesn’t let you do it, have manual connection approvals.Cryptographic hygiene: Enforce newer TLS ciphers and use ECDSA certs where you can. It’s faster and more secure. Final Thoughts Here’s the final parting thought: Zero trust isn’t just a feature you toggle on, but rather it’s a way of thinking or a mindset. When you apply this mindset to an AWS NLB setup, you can go from routing packets to actually securing it in a real-world use case and meaningful ways. AWS gives you the building blocks for zero trust, like static IPs, TLS, PrivateLink, IAM, and logs. It’s up to you to stitch them together. By ensuring zero trust practices are followed for an NLB, you make it not just fast, scalable, and reliable, but also smart and secure. And in today’s threat landscape, that’s what matters most.

By Sathish Holla
A Guide to Container Runtimes
A Guide to Container Runtimes

Kubernetes, also known as K8S, is an open-source container orchestration system that is used for automating the deployment, scaling, and management of containerized workloads. Containers are at the heart of the Kubernetes ecosystem and are the building blocks of the services built and managed by K8S. Understanding how containers are run is key to optimizing your Kubernetes environment. What Are Containers and Container Runtimes? Containers bundle applications and their dependencies in a lightweight, efficient way. Many people think Docker runs containers directly, but that's not quite accurate. Docker is actually a collection of tools sitting on top of a high-level container runtime, which in turn uses a low-level runtime implementation. Runtime systems dictate how the image being deployed by a container is managed. Container runtimes are the actual operators that run the containers in the most efficient way, and they affect how resources such as network, disk, performance, I/O, etc., are managed. So while Kubernetes orchestrates the containers, such as where to run containers, it is the runtime that executes those decisions. Picking a container runtime thus influences the application performance. Container runtimes themselves come in two flavors: high-level container runtimes that handle image management and container lifecycle, and low-level OCI-compliant runtimes that actually create and run the containers. Low-level runtimes are basically libraries that a developer of high-level runtimes can make use of while developing high-level runtimes to make use of the low-level features. A high-level runtime receives instructions, manages the necessary image, and then calls a low-level runtime to create and run the actual container process. What High-Level Container Runtime Should You Choose? There are various studies that compare the low-level runtimes, but it is also important that high-level container runtimes are chosen carefully. Docker: This is a container runtime that includes container creation, packing, sharing, and execution. Docker was created as a monolithic daemon, dockerd, and the docker client program, and features a client/server design. The daemon handled the majority of the logic for creating containers, managing images, and operating containers, as well as providing an API.ContainerD: This was created to be used by Docker and Kubernetes, as well as any other container technology that desires to abstract out syscalls and OS-specific functionality in order to operate containers on Linux, Windows, SUSE, as well as other operating systems.CRI-O: This was created specifically as a lightweight runtime only for Kubernetes and can handle only those kinds of operations. The runtimes mentioned are popular and are being offered by every major cloud provider. While Docker, as the high-level container runtime, is on its way out, the other two are here to stay. Parameters to Consider Performance: ContainerD or CRI-O is generally known to have better performance since the overhead of operations is lower. Docker is a monolithic system that has all the feature bits required, which increases the overhead. Although the network performance between the two is not very different, either can be chosen if that is an important factor.Features: Since ContainerD is a lightweight system, it does not always have all the features if that is an important consideration, whereas Docker has a large feature set. When comparing ContainerD to CRI-O, CRI-O has a smaller feature set since it only targets Kubernetes.Defaults: A lot of the cloud providers have recommendations for the managed container runtimes. There are benefits to using them directly since they should have longer support. Why Should You Consider Manual Deployment? Till now, I have talked about managed K8S deployment, which is provided by the major cloud providers such as Amazon, Microsoft, Google, etc. But there is another way of hosting your infrastructure — manage it on your own. This is where manual deployment comes in. You have full control over every single component in your system, giving you the ability to remove unnecessary features. But it introduces the overhead of managing the deployment. Conclusion It becomes vital to jot down the use case that is being tried to achieve while making decisions. For some cases, a manual deployment would be better, whereas in other cases, a managed deployment would win. By understanding these different components and trade-offs, you can make better informed decisions about configuring your high-level container runtime for optimal performance and manageability.

By Rachit Jain
Unlocking the Benefits of a Private API in AWS API Gateway
Unlocking the Benefits of a Private API in AWS API Gateway

AWS API Gateway is a managed service to create, publish, and manage APIs. It serves as a bridge between your applications and backend services. When creating APIs for our backend services, we tend to open it up using public IPs. Yes, we do authenticate and authorize access. However, oftentimes it is seen that a particular API is meant for internal applications only. In such cases it would be great to declare these as private. Public APIs expose your services to a broader audience over the internet and thus come with risks related to data exposure and unauthorized access. On the other hand, private APIs are meant for internal consumption only. This provides an additional layer of security and eliminates the risk of potential data theft and unauthorized access. AWS API Gateway supports private APIs. If an API is only by internal applications only it should be declared as private in API Gateway. This ensures that your data remains protected while still allowing teams to leverage the API for developing applications. The Architecture So, how does a private API really work? The first step is to mark the API as private when creating one in the API gateway. Once done, it will not have any public IP attached to it, which means that it will not be accessible over the Internet. Next, proceed with the API Gateway configuration. Define your resources and methods according to your application’s requirements. For each method, consider implementing appropriate authorization mechanisms such as IAM roles or resource policies to enforce strict access controls. Setting up the private access involves creating an interface VPC endpoint. The consumer applications would typically be running in a private subnet of a VPC. These applications would be able to access the api through the VPC endpoint. As an example, let us suppose that we are building an application using ECS as the compute service. The ECS cluster would run within a private subnet of a VPC. The application would need to access some common services of the application. These services are a set of microservices developed on Lambda and exposed through API Gateway. This is a perfect scenario and a pretty common one where it makes sense to declare these APIs as private. Key Benefits A private API can significantly increase the performance and security of an application. In this age of cybercrime, protecting data should be of utmost importance. Unscrupulous actors on the internet are always on the lookout for vulnerabilities, and any leak in the system poses a potential threat of data theft. Data security use cases are becoming incredibly important. This is where a private API is so advantageous. All interactions between services are within a private network, and since the services are not publicly exposed, there is no chance of data theft over the internet. Private APIs allow a secure method of data exchange, and the less exposed your data is, the better. Private APIs allow you to manage the overall data security aspects of your enterprise solution by letting you control access to sensitive data and ensuring it’s only exposed in the secure environments you’ve approved. The requests and responses don’t need to travel over the internet. Interactions are within a closed network. Resources in a VPC can interact with the API over private AWS network. This goes a long way in reducing latencies and optimizing network traffic. As a result private API can ensure better performance and for applications with quick processing needs can be a go to option. Moreover, private APIs make it easy to implement strong access control. You can determine, with near capability, who can access what from where, and what certain conditions need to be in place to do so, while providing custom access level groups as your organization sees fit. With the thoroughness of access control being signed off, not only is security improved, but you can also increase the flow to get things done. Finally, there is the element of cost that many do not consider when using private APIs in the AWS API Gateway as a benefit. Utilizing private APIs can significantly reduce the costs that flow when dealing with public traffic costs or resources that rely on perfect utilization in the transformed environment with the VPC. While you could think of this as a potential variable, and save you significant amounts of cost over time, if achieved. In addition to the benefits above, private APIs give your business the opportunity to develop an enterprise solution that meets your development needs. Building internal applications for your own use can help further customize your workflows or tailor customer experience, by allowing unique steps and experiences to be developed for customer journeys. Private APIs allow your organization to be dynamic and replicate tools or services quickly, while maintaining control of your technology platform. This allows your business to apply ideas and processes for future growth while remaining competitive in an evolving marketplace. Deploying private APIs within the AWS API Gateway is not solely a technical move — it is a means of investing in the reliability, future-proofing, and capability of your system architecture. The Importance of Making APIs Private In the modern digital world, securing your APIs has never been more important. If you don’t require public access to your APIs by clients, the best option is to make them private. By doing so, you can reduce the opportunity for threats and vulnerabilities to exist where they may compromise your data and systems. Public APIs become targets for anyone with malicious intent who wants to find and exploit openings. By keeping your APIs private and limiting access, you protect sensitive information and improve performance by removing unnecessary traffic. Additionally, utilizing best practices for secure APIs — using authentication protocols, testing for rate limiting, and encrypting your sensitive information — adds stronger front-line defenses. Making your APIs private is not just a defensive action, but a proactive strategy to secure the organization and their assets. In a world where breaches can result in catastrophic consequences, a responsible developer or organization should take every preemptive measure necessary to protect their digital environment. Best Practices The implementation of private APIs requires following best practices to achieve strong security, together with regulated access and efficient versioning. Safety needs to be your number one priority at all times. Your data protection against unauthorized access becomes possible through the implementation of OAuth or API keys authentication methods. Implementing a private API doesn’t mean that unauthorized access will not happen, and adequate protection should be in place. API integrity depends heavily on proper access control mechanisms. Role-based access control (RBAC) should be used to ensure users receive permissions that exactly match their needs. The implementation of this system protects sensitive endpoints from exposure while providing authorized users with smooth API interaction. The sustainable operation of your private API depends on proper management of its versioning system to satisfy users. A versioning system based on URL paths or request headers enables you to introduce new features and updates without disrupting existing integrations. The approach delivers a better user experience while establishing trust in API reliability. Conclusion In conclusion, private APIs aren't a passing fad; they are a deliberate initiative to help you maximize your applications with regard to supercharged security and efficiency. When you embrace private APIs, you are creating a method to protect sensitive data within a security-first framework, while enabling its use on other internal systems. In the environment of constant data breaches, that safeguard is paramount. The value of private APIs will undoubtedly improve not only the security posture of your applications but also the performance of your applications overall.

By Satrajit Basu DZone Core CORE
Event-Driven Architectures: Designing Scalable and Resilient Cloud Solutions
Event-Driven Architectures: Designing Scalable and Resilient Cloud Solutions

Event-driven architectures (EDA) have been a cornerstone in designing cloud systems that are future-proofed, scalable, resilient, and sustainable in nature. EDA is interested in generation, capture, and response to events and nothing more, not even in traditional systems of request-response. The paradigm is most suitable to systems that require high decoupling, elasticity, and fault tolerance. In this article, I'll be discussing the technical details of event-driven architectures, along with snippets of code, patterns, and practical strategies of implementation. Let's get started! Core Principles of Event-Driven Architecture Event-driven architecture (EDA) is a way of designing systems where different services communicate by responding to events as they happen. At its core, EDA relies on key principles that enable seamless interaction, scalability, and responsiveness across applications. They can be summarized as: 1. Event Producers, Consumers, and Brokers Event producers: Systems that produce events, i.e., the action of a user, sensor readings of Internet of Things (IoT) devices, or system events.Event consumers: Process or services that process events and take some action.Event brokers: Middleware components that manage communication between producers and consumers using event dissemination (e.g., Kafka, RabbitMQ, Amazon SNS). 2. Event Types Discrete events: Single events, i.e., logon of a user.Stream events: Streams of events, i.e., telemetry readings of an IoT sensor. 3. Asynchronous Communication EDA is asynchronous in nature, in which producers are decoupled from consumers. Systems can be evolved and scaled independently. 4. Eventual Consistency For distributed systems, EDA prefers eventual consistency over consistency, offering higher throughput and scalability. Benefits of event-driven architectures include: Scalability: Decoupled components can be scaled separately.Resilience: Failure in one component would not impact other components.Flexibility: One can plug in or replace pieces without a gigantic amount of reengineering.Real-time processing: EDA is a natural fit for processing in real time, analysis, monitoring, and alarming. Using EDA in Cloud Solutions To appreciate EDA in action, suppose you have a sample e-commerce cloud application that processes orders, maintains stock up to date, and notifies users in real time. Let's build this system ground up using contemporary cloud technologies and software design principles. The tech stack we'll be using in this tutorial: Event broker: Apache Kafka or Amazon EventBridgeConsumers/producers: Python microservicesCloud infrastructure: AWS Lambda, S3, DynamoDB Step 1: Define Events Decide on events that are driving your system. In an e-commerce application, events that you would generally find are something like these: OrderPlaced PaymentProcessed InventoryUpdated UserNotified Step 2: Event Schema Design an event schema to allow components to send events to each other in a standardized manner. Assuming you use JSON as the events structure, here's what a sample structure would look like (feel free to write your own format): JSON { "eventId": "12345", "eventType": "OrderPlaced", "timestamp": "2025-01-01T12:00:00Z", "data": { "orderId": "67890", "userId": "abc123", "totalAmount": 150.75 } } Step 3: Producer Implementation An OrderService produces events when a new order is created by a customer. Here's what it looks like: Python from kafka import KafkaProducer import json def produce_event(event_type, data): producer = KafkaProducer( bootstrap_servers='localhost:9092', value_serializer=lambda v: json.dumps(v).encode('utf-8')) event = { "eventId": "12345", "eventType": event_type, "timestamp": "2025-01-01T12:00:00Z", "data": data } producer.send('order_events', value=event) producer.close() # Example usage order_data = { "orderId": "67890", "userId": "abc123", "totalAmount": 150.75 } produce_event("OrderPlaced", order_data) Step 4: Event Consumer An OrderPlaced event is processed by a NotificationService to notify the user. Let's quickly write up a Python script to consume the events: Python from kafka import KafkaConsumer import json def consume_events(): consumer = KafkaConsumer( 'order_events', bootstrap_servers='localhost:9092', value_deserializer=lambda v: json.loads(v.decode('utf-8')) ) for message in consumer: event = message.value if event['eventType'] == "OrderPlaced": send_notification(event['data']) def send_notification(order_data): print(f"Sending notification for Order ID: {order_data['orderId']} to User ID: {order_data['userId']}") # Example usage consume_events() Step 5: Event Broker Configuration Create Kafka or a cloud-native event broker like Amazon EventBridge to route events to their destinations. In Kafka, create a topic named order_events. Shell kafka-topics --create --topic order_events --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1 We'll use this as a for storing and organizing data. Topics are similar to folders in a file system, where events are the files. Fault Tolerance and Scaling Fault tolerance and scalability are achieved by decoupling components in a manner that each of them fails without jeopardizing the system, making it convenient to scale horizontally by adding or deleting components to accommodate different workloads to support different demands; such a design is highly resilient and scalable to different demands. Some of the methods are: 1. Dead Letter Queues (DLQs) Queue failed events to retry later using DLQs. As a sample, in case of failure in processing events in the NotificationService, it can be sent to a DLQ to be retried. 2. Horizontal Scaling Horizontally scale up consumers to process more events in parallel. Kafka consumer groups are provided out of the box to distribute messages across multiple consumers. 3. Retry Mechanism Use exponential backoff retry in case of failure. Here's an example: Python import time def process_event_with_retries(event, max_retries=3): for attempt in range(max_retries): try: send_notification(event['data']) break except Exception as e: print(f"Attempt {attempt + 1} failed: {e}") time.sleep(2 ** attempt) Advanced Patterns in EDA Let's now explore some advanced patterns that are essential for event-driven architecture (EDA). Buckle up! 1. Event Sourcing "Event Sourcing Pattern" refers to a design approach where every change to an application's state is captured and stored as a sequence of events. Here's an example to save all events to be able to retrieve the system state at any given point in time. Helpful for audit trails and debugging. Here's a sample Python program example: Python # Save event to a persistent store import boto3 dynamodb = boto3.resource('dynamodb') event_table = dynamodb.Table('EventStore') def save_event(event): event_table.put_item(Item=event) 2. CQRS (Command Query Responsibility Segregation) The command query responsibility segregation (CQRS) pattern separates the data mutation, or the command part of a system, from the query part. You can use the CQRS pattern to separate updates and queries if they have different requirements for throughput, latency, or consistency. This allows each model to be optimized independently and can improve performance, scalability, and security of an application. 3. Streaming Analytics Use Apache Flink or AWS Kinesis Data Analytics to process streams of events in real-time to get insights and send alarms. To deploy and run the streaming ETL pipeline, the architecture relies on Kinesis Data Analytics. Kinesis Data Analytics enables you to run Flink applications in a fully managed environment. The service provisions and manages the required infrastructure, scales the Flink application in response to changing traffic patterns, and automatically recovers from infrastructure and application failures. You can combine the expressive Flink API for processing streaming data with the advantages of a managed service by using Kinesis Data Analytics to deploy and run Flink applications. It allows you to build robust streaming ETL pipelines and reduces the operational overhead of provisioning and operating infrastructure. Conclusion Event-driven architectures are a strongly compelling paradigm for building scalable and resilient systems in the cloud. With asynchronous communication, eventual consistency, and advanced patterns such as event sourcing and CQRS, developers can build resilient systems that can cope with changing requirements. Such tools of today, such as Kafka, AWS EventBridge, and microservices, enable one to use EDA easily in a multi-cloud environment. This article, loaded with practical application use cases, is just the start of applying event-driven architecture to your next cloud project. With EDA, companies can enjoy the complete benefits of real-time processing, scalability, and fault tolerance.

By Srinivas Chippagiri DZone Core CORE
How to Configure and Customize the Go SDK for Azure Cosmos DB
How to Configure and Customize the Go SDK for Azure Cosmos DB

The Go SDK for Azure Cosmos DB is built on top of the core Azure Go SDK package, which implements several patterns that are applied throughout the SDK. The core SDK is designed to be quite customizable, and its configurations can be applied with the ClientOptions struct when creating a new Cosmos DB client object using NewClient (and other similar functions). If you peek inside the azcore.ClientOptions struct, you will notice that it has many options for configuring the HTTP client, retry policies, timeouts, and other settings. In this blog, we will cover how to make use of (and extend) these common options when building applications with the Go SDK for Cosmos DB. I have provided code snippets throughout this blog. Refer to this GitHub repository for runnable examples. Retry Policies Common retry scenarios are handled in the SDK. You can dig into cosmos_client_retry_policy.go for more info. Here is a summary of errors for which retries are attempted: Error Type / Status CodeRetry LogicNetwork Connection ErrorsRetry after marking endpoint unavailable and waiting for defaultBackoff.403 Forbidden (with specific substatuses)Retry after marking endpoint unavailable and updating the endpoint manager.404 Not Found (specific substatus)Retry by switching to another session or endpoint.503 Service UnavailableRetry by switching to another preferred location. Let's see some of these in action. Non-Retriable Errors For example, here is a function that tries to read a database that does not exist. Go func retryPolicy1() { c, err := auth.GetClientWithDefaultAzureCredential("https://demodb.documents.azure.com:443/", nil) if err != nil { log.Fatal(err) } azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) // Set logging level to include retries azlog.SetEvents(azlog.EventRetryPolicy) db, err := c.NewDatabase("i_dont_exist") if err != nil { log.Fatal("NewDatabase call failed", err) } _, err = db.Read(context.Background(), nil) if err != nil { log.Fatal("Read call failed: ", err) } The azcore logging implementation is configured using SetListener and SetEvents to write retry policy event logs to standard output. See Logging section in azcosmos package README for details. Let's look at the logs generated when this code is run: Plain Text //.... Retry Policy Event: exit due to non-retriable status code Retry Policy Event: =====> Try=1 for GET https://demodb.documents.azure.com:443/dbs/i_dont_exist Retry Policy Event: response 404 Retry Policy Event: exit due to non-retriable status code Read call failed: GET https://demodb-region.documents.azure.com:443/dbs/i_dont_exist -------------------------------------------------------------------------------- RESPONSE 404: 404 Not Found ERROR CODE: 404 Not Found When a request is made to read a non-existent database, the SDK gets a 404 (not found) response for the database. This is recognized as a non-retriable error, and the SDK stops retrying. Retries are only performed for retriable errors (like network issues or certain status codes). The operation failed because the database does not exist. Retriable Errors - Invalid Account This function tries to create a Cosmos DB client using an invalid account endpoint. It sets up logging for retry policy events and attempts to create a database. Go func retryPolicy2() { c, err := auth.GetClientWithDefaultAzureCredential("https://iamnothere.documents.azure.com:443/", nil) if err != nil { log.Fatal(err) } azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) // Set logging level to include retries azlog.SetEvents(azlog.EventRetryPolicy) _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil) if err != nil { log.Fatal(err) } } Let's look at the logs generated when this code is run, and see how the SDK handles retries when the endpoint is unreachable: Plain Text //.... Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #1, Delay=682.644105ms Retry Policy Event: =====> Try=2 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #2, Delay=2.343322179s Retry Policy Event: =====> Try=3 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #3, Delay=7.177314269s Retry Policy Event: =====> Try=4 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: MaxRetries 3 exceeded failed to retrieve account properties: Get "https://iamnothere.docume Each failed attempt is logged, and the SDK retries the operation several times (three times to be specific), with increasing delays between attempts. After exceeding the maximum number of retries, the operation fails with an error indicating the host could not be found - the SDK automatically retries transient network errors before giving up. But you don't have to stick to the default retry policy. You can customize the retry policy by setting the azcore.ClientOptions when creating the Cosmos DB client. Configurable Retries Let's say you want to set a custom retry policy with a maximum of two retries and a delay of one second between retries. You can do this by creating a policy.RetryOptions struct and passing it to the azcosmos.ClientOptions when creating the client. Go func retryPolicy3() { retryPolicy := policy.RetryOptions{ MaxRetries: 2, RetryDelay: 1 * time.Second, } opts := azcosmos.ClientOptions{ ClientOptions: policy.ClientOptions{ Retry: retryPolicy, }, } c, err := auth.GetClientWithDefaultAzureCredential("https://iamnothere.documents.azure.com:443/", &opts) if err != nil { log.Fatal(err) } log.Println(c.Endpoint()) azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) azlog.SetEvents(azlog.EventRetryPolicy) _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil) if err != nil { log.Fatal(err) } } Each failed attempt is logged, and the SDK retries the operation according to the custom policy — only two retries, with a 1-second delay after the first attempt and a longer delay after the second. After reaching the maximum number of retries, the operation fails with an error indicating the host could not be found. Plain Text Retry Policy Event: =====> Try=1 for GET https://iamnothere.documents.azure.com:443/ //.... Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #1, Delay=1.211970493s Retry Policy Event: =====> Try=2 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: End Try #2, Delay=3.300739653s Retry Policy Event: =====> Try=3 for GET https://iamnothere.documents.azure.com:443/ Retry Policy Event: error Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host Retry Policy Event: MaxRetries 2 exceeded failed to retrieve account properties: Get "https://iamnothere.documents.azure.com:443/": dial tcp: lookup iamnothere.documents.azure.com: no such host exit status 1 Note: The first attempt is not counted as a retry, so the total number of attempts is three (1 initial + 2 retries). You can customize this further by implementing fault injection policies. This allows you to simulate various error scenarios for testing purposes. Fault Injection For example, you can create a custom policy that injects a fault into the request pipeline. Here, we use a custom policy (FaultInjectionPolicy) that simulates a network error on every request. Go type FaultInjectionPolicy struct { failureProbability float64 // e.g., 0.3 for 30% chance to fail } // Implement the Policy interface func (f *FaultInjectionPolicy) Do(req *policy.Request) (*http.Response, error) { if rand.Float64() < f.failureProbability { // Simulate a network error return nil, &net.OpError{ Op: "read", Net: "tcp", Err: errors.New("simulated network failure"), } } // no failure - continue with the request return req.Next() } This can be used to inject custom failures into the request pipeline. The function configures the Cosmos DB client to use this policy, sets up logging for retry events, and attempts to create a database. Go func retryPolicy4() { opts := azcosmos.ClientOptions{ ClientOptions: policy.ClientOptions{ PerRetryPolicies: []policy.Policy{&FaultInjectionPolicy{failureProbability: 0.6}, }, } c, err := auth.GetClientWithDefaultAzureCredential("https://ACCOUNT_NAME.documents.azure.com:443/", &opts) // Updated to use opts if err != nil { log.Fatal(err) } azlog.SetListener(func(cls azlog.Event, msg string) { // Log retry-related events switch cls { case azlog.EventRetryPolicy: fmt.Printf("Retry Policy Event: %s\n", msg) } }) // Set logging level to include retries azlog.SetEvents(azlog.EventRetryPolicy) _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test_1"}, nil) if err != nil { log.Fatal(err) } } Take a look at the logs generated when this code is run. Each request attempt fails due to the simulated network error. The SDK logs each retry, with increasing delays between attempts. After reaching the maximum number of retries (default = 3), the operation fails with an error indicating a simulated network failure. Note: This can change depending on the failure probability you set in the FaultInjectionPolicy. In this case, we set it to 0.6 (60% chance to fail), so you may see different results each time you run the code. Plain Text Retry Policy Event: =====> Try=1 for GET https://ACCOUNT_NAME.documents.azure.com:443/ //.... Retry Policy Event: MaxRetries 0 exceeded Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #1, Delay=794.018648ms Retry Policy Event: =====> Try=2 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #2, Delay=2.374693498s Retry Policy Event: =====> Try=3 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #3, Delay=7.275038434s Retry Policy Event: =====> Try=4 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: MaxRetries 3 exceeded Retry Policy Event: =====> Try=1 for GET https://ACCOUNT_NAME.documents.azure.com:443/ Retry Policy Event: error read tcp: simulated network failure Retry Policy Event: End Try #1, Delay=968.457331ms 2025/05/05 19:53:50 failed to retrieve account properties: read tcp: simulated network failure exit status 1 Do take a look at Custom HTTP pipeline policies in the Azure SDK for Go documentation for more information on how to implement custom policies. HTTP-Level Customizations There are scenarios where you may need to customize the HTTP client used by the SDK. For example, when using the Cosmos DB emulator locally, you want to skip certificate verification to connect without SSL errors during development or testing. TLSClientConfig allows you to customize TLS settings for the HTTP client and setting InsecureSkipVerify: true disables certificate verification – useful for local testing but insecure for production. Go func customHTTP1() { // Create a custom HTTP client with a timeout client := &http.Client{ Transport: &http.Transport{ TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, }, } clientOptions := &azcosmos.ClientOptions{ ClientOptions: azcore.ClientOptions{ Transport: client, }, } c, err := auth.GetEmulatorClientWithAzureADAuth("http://localhost:8081", clientOptions) if err != nil { log.Fatal(err) } _, err = c.CreateDatabase(context.Background(), azcosmos.DatabaseProperties{ID: "test"}, nil) if err != nil { log.Fatal(err) } } All you need to do is pass the custom HTTP client to the ClientOptions struct when creating the Cosmos DB client. The SDK will use this for all requests. Another scenario is when you want to set a custom header for all requests to track requests or add metadata. All you need to do is implement the Do method of the policy.Policy interface and set the header in the request: Go type CustomHeaderPolicy struct{} func (c *CustomHeaderPolicy) Do(req *policy.Request) (*http.Response, error) { correlationID := uuid.New().String() req.Raw().Header.Set("X-Correlation-ID", correlationID) return req.Next() } Looking at the logs, notice the custom header X-Correlation-ID is added to each request: Plain Text //... Request Event: ==> OUTGOING REQUEST (Try=1) GET https://ACCOUNT_NAME.documents.azure.com:443/ Authorization: REDACTED User-Agent: azsdk-go-azcosmos/v1.3.0 (go1.23.6; darwin) X-Correlation-Id: REDACTED X-Ms-Cosmos-Sdk-Supportedcapabilities: 1 X-Ms-Date: Tue, 06 May 2025 04:27:37 GMT X-Ms-Version: 2020-11-05 Request Event: ==> OUTGOING REQUEST (Try=1) POST https://ACCOUNT_NAME-region.documents.azure.com:443/dbs Authorization: REDACTED Content-Length: 27 Content-Type: application/query+json User-Agent: azsdk-go-azcosmos/v1.3.0 (go1.23.6; darwin) X-Correlation-Id: REDACTED X-Ms-Cosmos-Sdk-Supportedcapabilities: 1 X-Ms-Date: Tue, 06 May 2025 04:27:37 GMT X-Ms-Documentdb-Query: True X-Ms-Version: 2020-11-05 OpenTelemetry Support The Azure Go SDK supports distributed tracing via OpenTelemetry. This allows you to collect, export, and analyze traces for requests made to Azure services, including Cosmos DB. The azotel package is used to connect an instance of OpenTelemetry's TracerProvider to an Azure SDK client (in this case, Cosmos DB). You can then configure the TracingProvider in azcore.ClientOptions to enable automatic propagation of trace context and emission of spans for SDK operations. Go func getClientOptionsWithTracing() (*azcosmos.ClientOptions, *trace.TracerProvider) { exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint()) if err != nil { log.Fatalf("failed to initialize stdouttrace exporter: %v", err) } tp := trace.NewTracerProvider(trace.WithBatcher(exporter)) otel.SetTracerProvider(tp) op := azcosmos.ClientOptions{ ClientOptions: policy.ClientOptions{ TracingProvider: azotel.NewTracingProvider(tp, nil), }, } return &op, tp } The above function creates a stdout exporter for OpenTelemetry (prints traces to the console). It sets up a TracerProvider, registers this as the global tracer, and returns a ClientOptions struct with the TracingProvider set, ready to be used with the Cosmos DB client. Go func tracing() { op, tp := getClientOptionsWithTracing() defer func() { _ = tp.Shutdown(context.Background()) }() c, err := auth.GetClientWithDefaultAzureCredential("https://ACCOUNT_NAME.documents.azure.com:443/", op) //.... container, err := c.NewContainer("existing_db", "existing_container") if err != nil { log.Fatal(err) } //ctx := context.Background() tracer := otel.Tracer("tracer_app1") ctx, span := tracer.Start(context.Background(), "query-items-operation") defer span.End() query := "SELECT * FROM c" pager := container.NewQueryItemsPager(query, azcosmos.NewPartitionKey(), nil) for pager.More() { queryResp, err := pager.NextPage(ctx) if err != nil { log.Fatal("query items failed:", err) } for _, item := range queryResp.Items { log.Printf("Queried item: %+v\n", string(item)) } } } The above function calls getClientOptionsWithTracing to get tracing-enabled options and a tracer provider, and ensures the tracer provider is shut down at the end (flushes traces). It creates a Cosmos DB client with tracing enabled, executes an operation to query items in a container. The SDK call is traced automatically, and exported to stdout in this case. You can plug in any OpenTelemetry-compatible tracer provider and traces can be exported to various backend. Here is a snippet for Jaeger exporter. The traces are quite large, so here is a small snippet of the trace output. Check the query_items_trace.txt file in the repo for the full trace output: Go //... { "Name": "query_items democontainer", "SpanContext": { "TraceID": "39a650bcd34ff70d48bbee467d728211", "SpanID": "f2c892bec75dbf5d", "TraceFlags": "01", "TraceState": "", "Remote": false }, "Parent": { "TraceID": "39a650bcd34ff70d48bbee467d728211", "SpanID": "b833d109450b779b", "TraceFlags": "01", "TraceState": "", "Remote": false }, "SpanKind": 3, "StartTime": "2025-05-06T17:59:30.90146+05:30", "EndTime": "2025-05-06T17:59:36.665605042+05:30", "Attributes": [ { "Key": "db.system", "Value": { "Type": "STRING", "Value": "cosmosdb" } }, { "Key": "db.cosmosdb.connection_mode", "Value": { "Type": "STRING", "Value": "gateway" } }, { "Key": "db.namespace", "Value": { "Type": "STRING", "Value": "demodb-gosdk3" } }, { "Key": "db.collection.name", "Value": { "Type": "STRING", "Value": "democontainer" } }, { "Key": "db.operation.name", "Value": { "Type": "STRING", "Value": "query_items" } }, { "Key": "server.address", "Value": { "Type": "STRING", "Value": "ACCOUNT_NAME.documents.azure.com" } }, { "Key": "az.namespace", "Value": { "Type": "STRING", "Value": "Microsoft.DocumentDB" } }, { "Key": "db.cosmosdb.request_charge", "Value": { "Type": "STRING", "Value": "2.37" } }, { "Key": "db.cosmosdb.status_code", "Value": { "Type": "INT64", "Value": 200 } } ], //.... Refer to Semantic Conventions for Microsoft Cosmos DB. What About Other Metrics? When executing queries, you can get basic metrics about the query execution. The Go SDK provides a way to access these metrics through the QueryResponse struct in the QueryItemsResponse object. This includes information about the query execution, including the number of documents retrieved, etc. Plain Text func queryMetrics() { //.... container, err := c.NewContainer("existing_db", "existing_container") if err != nil { log.Fatal(err) } query := "SELECT * FROM c" pager := container.NewQueryItemsPager(query, azcosmos.NewPartitionKey(), nil) for pager.More() { queryResp, err := pager.NextPage(context.Background()) if err != nil { log.Fatal("query items failed:", err) } log.Println("query metrics:\n", *queryResp.QueryMetrics) //.... } } The query metrics are provided as a simple raw string in a key-value format (semicolon-separated), which is very easy to parse. Here is an example: Plain Text totalExecutionTimeInMs=0.34;queryCompileTimeInMs=0.04;queryLogicalPlanBuildTimeInMs=0.00;queryPhysicalPlanBuildTimeInMs=0.02;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=0.07;indexLookupTimeInMs=0.00;instructionCount=41;documentLoadTimeInMs=0.04;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=9;retrievedDocumentSize=1251;outputDocumentCount=9;outputDocumentSize=2217;writeOutputTimeInMs=0.02;indexUtilizationRatio=1.00 Here is a breakdown of the metrics you can obtain from the query response: Plain Text | Metric | Unit | Description | | ------------------------------ | ----- | ------------------------------------------------------------ | | totalExecutionTimeInMs | ms | Total time taken to execute the query, including all phases. | | queryCompileTimeInMs | ms | Time spent compiling the query. | | queryLogicalPlanBuildTimeInMs | ms | Time spent building the logical plan for the query. | | queryPhysicalPlanBuildTimeInMs | ms | Time spent building the physical plan for the query. | | queryOptimizationTimeInMs | ms | Time spent optimizing the query. | | VMExecutionTimeInMs | ms | Time spent executing the query in the Cosmos DB VM. | | indexLookupTimeInMs | ms | Time spent looking up indexes. | | instructionCount | count | Number of instructions executed for the query. | | documentLoadTimeInMs | ms | Time spent loading documents from storage. | | systemFunctionExecuteTimeInMs | ms | Time spent executing system functions in the query. | | userFunctionExecuteTimeInMs | ms | Time spent executing user-defined functions in the query. | | retrievedDocumentCount | count | Number of documents retrieved by the query. | | retrievedDocumentSize | bytes | Total size of documents retrieved. | | outputDocumentCount | count | Number of documents returned as output. | | outputDocumentSize | bytes | Total size of output documents. | | writeOutputTimeInMs | ms | Time spent writing the output. | | indexUtilizationRatio | ratio | Ratio of index utilization (1.0 means fully utilized). | Conclusion In this blog, we covered how to configure and customize the Go SDK for Azure Cosmos DB. We looked at retry policies, HTTP-level customizations, OpenTelemetry support, and how to access query metrics. The Go SDK for Azure Cosmos DB is designed to be flexible and customizable, allowing you to tailor it to your specific needs. For more information, refer to the package documentation and the GitHub repository. I hope you find this useful! Resources Go SDK for Azure Cosmos DBCore Azure Go SDK packageClientOptionsNewClient

By Abhishek Gupta DZone Core CORE
Fixing Common Oracle Database Problems
Fixing Common Oracle Database Problems

Lots of businesses use Oracle databases to keep their important stuff. These databases mostly work fine, but yeah, sometimes they run into issues. Anyone who's worked with Oracle knows the feeling when things go wrong. Don't worry, though — these problems happen to everyone. Most fixes are actually pretty easy once you know what you are doing. I'll show you the usual Oracle headaches and how to fix them. 1. The "Snapshot Too Old" Error (ORA-01555) What's Happening Oracle is basically saying, "I can't remember what that data looked like anymore" when your query runs too long. Why It Happens Oracle already overwrote the old data it was keeping for reference.Your query is taking longer than Oracle is set to remember things.You are committing changes too often in a loop. How to Fix It Tell Oracle to remember things longer. SQL ALTER SYSTEM SET UNDO_RETENTION = 2000; Don't put COMMIT statements inside loops.Improve the performance of the queries by adding appropriate indexes. 2. The "Resource Busy" Error (ORA-00054) What's Happening You are trying to update something that someone/process is already using. Why It Happens Another user or process has locked the table or row you want. How to Fix It Find out who is blocking it. SQL SELECT s1.sid AS blocked_session_id, s1.serial# AS blocked_serial_num, s1.username AS blocked_user, s1.machine AS blocked_machine, s1.blocking_session AS blocking_session_id, s1.sql_id AS blocked_sql_id FROM v$session s1 WHERE s1.blocking_session IS NOT NULL ORDER BY s1.blocking_session; If needed, kill the process or just tell Oracle to wait a bit instead of giving up right away. SQL ALTER SYSTEM KILL SESSION 'sid, serial#' IMMEDIATE; 3. Sudden Disconnection Error (ORA-03113) What's Happening Your connection to the database dropped unexpectedly. Why It Happens Network issuesThe database crashed or restartedSoftware bugs How to Fix It Keep an eye on logs for alerts and make sure the database has not crashed.Look at the trace files for clues.Make sure your network is stable.Update Oracle if it's a known bug. 4. Permission Denied Error (ORA-01031) What's Happening Oracle won't let you do something because you don't have permission. Why It Happens Your user account doesn't have the right privileges. How to Fix It Get the permission you need. SQL GRANT CREATE TABLE TO username; For looking at someone else's data. SQL GRANT SELECT ON hr.employees TO app_user; 5. Password Expired Error (ORA-28001) What's Happening The user's password has expired. Why It Happens Oracle is enforcing password expiration rules. How to Fix It Reset the password: SQL ALTER USER username IDENTIFIED BY new_password; Or stop passwords from expiring. SQL ALTER PROFILE DEFAULT LIMIT PASSWORD_LIFE_TIME UNLIMITED; 6. Can't Connect Error (ORA-12154) What's Happening Oracle doesn't understand how to connect to the database you are asking for. Why It Happens The connection info is wrong or missing in your setup files. How to Fix It Check your tnsnames.ora file for mistakes.Make sure the service name matches.Try the simple connection format instead. SQL sqlplus user/password@//host:port/service_name 7. No More Space Error (ORA-01653) What's Happening You can't add more data because you are out of space. Why It Happens Your database file is full and cannot grow automatically. How to Fix It Add another data file: SQL ALTER TABLESPACE users ADD DATAFILE '/path/users10.dbf' SIZE 250M AUTOEXTEND ON MAXSIZE 500M; Let your existing file grow automatically. SQL ALTER DATABASE DATAFILE '/path/users11.dbf' RESIZE 500M; Keep an eye on your space. SQL SELECT tablespace_name, used_space, tablespace_size FROM dba_tablespace_usage_metrics; 8. Internal Error (ORA-00600) What's Happening Oracle came across a problem that it doesn’t know how to handle. Why It Happens Memory or data corruptionHardware failuresIncompatible parameter settings How to Fix It Run DBVERIFY or ANALYZE commands to check if the database is corrupted; if so, then it has to be restored from the backup.Work with Oracle support and share the logs and errors to help debug. 9. Super Slow Queries What's Happening This is a common problem where the query performance degrades with an increase in data. Why It Happens Poorly written SQLMissing indexesOutdated statisticsRunning the queries without filters How to Fix It See how Oracle is running your query. SQL EXPLAIN PLAN FOR type_your_query_here; SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY); Add indexes where needed.WHERE clauses should be used properly to return the correct result.Update the statistics. SQL EXEC DBMS_STATS.GATHER_TABLE_STATS('schema', 'table_name'); 10. Corrupted Data What's Happening Part of your database file got corrupted or damaged. Why It Happens Hardware failuresSudden shutdownsSoftware bugs How to Fix It Find the bad blocks. SQL SELECT * FROM v$database_block_corruption; Use RMANto repair them. SQL RMAN> BLOCKRECOVER DATAFILE 5 BLOCK 233 TO 245; Mark blocks that can't be fixed. SQL EXEC DBMS_REPAIR.ADMIN_TABLES(TRUE, FALSE, 'REPAIR_TABLE'); 11. High CPU Usage by Oracle Applications What's Happening Oracle is using too much CPU and slowing everything down. Why It Happens Inefficient queriesMissing indexesToo many background processes How to Fix It Find the queries that consume high CPU. SQL SELECT * FROM v$sql ORDER BY cpu_time DESC FETCH FIRST 20 ROWS ONLY; Fix those queries.Run performance reports if you have the license.Consider moving old data to archives. Tips to Avoid Problems Always have backups. This is the key to any database management as the system can always be reverted to avoid any data loss.Update statistics regularly. Oracle needs current info to work wellCheck logs often. Catch problems early by analyzing the logs periodicallyTest before production. Try changes in a test/stage environment first so that the majority of issues are caught before promoting the code to prod.Set up automatic health checks. Schedules can help keep everything aligned by running the processes on time. Conclusion Working with Oracle databases gets easier with practice. A lot of the problems you will run into are just the same few issues that everyone deals with. The more you work with these issues, the faster you'll spot and fix them. Hopefully, this article makes your database work a little easier and less stressful.

By Dhaval Patolia
Debugging Core Dump Files on Linux - A Detailed Guide
Debugging Core Dump Files on Linux - A Detailed Guide

Core dumps play a key role in debugging programs that exit abnormally. They preserve the state of a program at failure, and with them, programmers can view and identify the causes of failures. In this article, a walkthrough is taken through a step-by-step exercise of enabling, creating, and checking out core dumps in Linux, and touches on high-end tools and techniques for debugging sophisticated failures, enabling quick diagnoses and resolution. 1. Enabling Core Dumps in Linux Check and Set ulimit Check for current value for core dumps: Shell ulimit -c Output: Shell 0 A value of 0 informs us that core dumps have been disabled.. Enable core dumps temporarily: Shell ulimit -c unlimited Check again: Shell ulimit -c Output: Shell unlimited To make it persistent, add the following in /etc/security/limits.conf: Shell * soft core unlimited * hard core unlimited Configure location for core dumps: Shell sudo sysctl -w kernel.core_pattern=/var/dumps/core.%e.%p %e: Program name %p: Process ID Make it persistent: Add the following in /etc/sysctl.conf: Shell kernel.core_pattern=/var/dumps/core.%e.%p Reload configuration: Shell sudo sysctl -p 2. Generate Core Dumps for Testing C Program to cause Segfault Create a testing program: Shell #include <stdio.h> int main() { int *ptr = NULL; // Null pointer *ptr = 1; // Segmentation fault return 0; } Make a build of it: Shell gcc -g -o crash_test crash_test.c Execute the program: Shell ./crash_test Output: Shell Segmentation fault (core dumped) Check for location of the core dump: Shell ls /var/dumps Example: Shell core.crash_test.12345 3. Analyze with GDB Load the Core Dump Shell gdb ./crash_test /var/dumps/core.crash_test.12345 Basic Analysis Get Backtrace Shell bt Output: Shell #0 0x0000000000401132 in main () at crash_test.c:4 4 *ptr = 1; The backtrace identifies that the crash happened at line 4. Examine Variables Shell info locals Output: Shell ptr = (int *) 0x0 The variable ptr is a null pointer, and a segmentation fault is confirmed. Disassemble the Code Shell disassemble main Output: Shell Dump of assembler code for function main: 0x000000000040112a <+0>: mov %rsp,%rbp 0x000000000040112d <+3>: movl $0x0,-0x4(%rbp) 0x0000000000401134 <+10>: movl $0x1,(%rax) Displays actual assembly instruction at location of crash. Check Registers Shell info registers Output: Shell rax 0x0 0 rbx 0x0 0 rcx 0x0 0 Register rax is 0, showing the null pointer dereference. 4. Debugging Multithreaded Applications Check Threads Shell info threads Output: Shell Id Target Id Frame * 1 Thread 0x7f64c56 (LWP 12345) "crash_test" main () at crash_test.c:4 Switch to a Specific Thread Shell thread 1 Get Backtrace for All Threads Shell thread apply all bt 5. Using Advanced Tools Valgrind – Analysis of Memory Issue Shell valgrind --tool=memcheck ./crash_test Output: Shell Invalid write of size 4 at 0x401132: main (crash_test.c:4) Address 0x0 is not stack'd, malloc'd or (recently) free'd Confirms an invalid access in memory. ELFutils for Symbol Inspection Shell eu-readelf -a /var/dumps/core.crash_test.12345 Output: Shell Program Headers: LOAD 0x000000 0x004000 0x004000 0x1234 bytes Displays sections and symbol information in the core file. Crash Utility for Kernel Dumps Shell sudo crash /usr/lib/debug/vmlinux /var/crash/core Use for kernel-space core dumps. Generate Core Dumps for Running Processes Shell gcore <PID> 6. Debugging Specific Issues Segmentation Faults Shell info frame x/16xw $sp Check near stack pointer in memory. Heap Corruption Check for heap corruption with Valgrind or AddressSanitizer: Shell gcc -fsanitize=address -g -o crash_test crash_test.c ./crash_test Shared Libraries Mismatch Shell info shared ldd ./crash_test Check shared libraries loaded in a proper manner. 7. Best Practices for Core Dump Debugging Save Symbols Independently: Deploy stripped binaries for production and store debug symbols in a secure manner.Automate Dump: Employ systemd-coredump for efficient dump management.Analyze Logs: Enable full application logs for tracking run-time faults.Redact Sensitive Information: Remove sensitive information in shared core dumps.Test in Debugging: Employ debug build with full symbols for in-depth debugging. 8. Conclusion Debugging a core dump in Linux is a logical progression and utilizes tools such as GDB, Valgrind, and Crash Utility. By careful examination of backtraces, state of memory, and register values, developers can pinpoint root causes and remedy them in no time. Best practices will yield more efficient diagnostics and rapid resolution for production-critical environments.

By Srinivas Chippagiri DZone Core CORE

Top Tools Experts

expert thumbnail

Abhishek Gupta

Principal PM, Azure Cosmos DB,
Microsoft

I mostly work on open-source technologies including distributed data systems, Kubernetes and Go
expert thumbnail

Yitaek Hwang

Software Engineer,
NYDIG

The Latest Tools Topics

article thumbnail
AWS to Azure Migration: A Cloudy Journey of Challenges and Triumphs
Migrating from AWS to Azure isn’t a simple swap — it needs planning, testing, and adaptation, from cost benefits to Microsoft integration.
May 15, 2025
by Abhi Sangani
· 2,216 Views
article thumbnail
Vibe Coding With GitHub Copilot: Optimizing API Performance in Fintech Microservices
Can GitHub Copilot optimize fintech APIs? We test its performance, help, vibe coding flow, real-world impact, and limits.
May 15, 2025
by Sibasis Padhi
· 3,532 Views · 3 Likes
article thumbnail
A Simple, Convenience Package for the Azure Cosmos DB Go SDK
Learn about cosmosdb-go-sdk-helper: Simplify Azure Cosmos DB operations with Go. Features auth, queries, error handling, metrics, and Azure Functions support.
May 14, 2025
by Abhishek Gupta DZone Core CORE
· 1,770 Views
article thumbnail
Concourse CI/CD Pipeline: Webhook Triggers
Learn how to set up Concourse CI/CD pipelines, integrate GitHub webhooks, and deploy automation efficiently with step-by-step Docker and YAML examples.
May 13, 2025
by Karthigayan Devan
· 2,394 Views · 2 Likes
article thumbnail
Automatic Code Transformation With OpenRewrite
Learn how OpenRewrite enhances automated refactoring, improves code quality, and tackles maintenance challenges with real-world examples and benefits.
May 9, 2025
by Gangadhararamachary Ramadugu
· 2,777 Views · 1 Like
article thumbnail
A Complete Guide to Modern AI Developer Tools
This guide explores the most impactful AI developer tools, highlighting their features, installation steps, strengths, and limitations.
May 9, 2025
by Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE
· 3,134 Views · 5 Likes
article thumbnail
Immutable Secrets Management: A Zero-Trust Approach to Sensitive Data in Containers
Immutable secrets and Zero-Trust on Amazon Web Services boost container security, delivery, and resilience, aligning with ChaosSecOps for DevOps awards.
May 9, 2025
by Ramesh Krishna Mahimalur
· 3,478 Views · 4 Likes
article thumbnail
Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways
Optimize traffic in multi-cloud Kubernetes with multiple Istio Ingress Gateways. Learn how they enhance scalability, security, and control through best practices.
May 8, 2025
by Prabhu Chinnasamy
· 4,475 Views · 22 Likes
article thumbnail
How to Configure and Customize the Go SDK for Azure Cosmos DB
Explore Azure Cosmos DB Go SDK: Configure retry policies, customize HTTP pipelines, implement OpenTelemetry tracing, and analyze detailed query metrics.
May 8, 2025
by Abhishek Gupta DZone Core CORE
· 3,156 Views · 1 Like
article thumbnail
Testing SingleStore's MCP Server
Learn how to install, configure, and run the SingleStore MCP Server with MCPHost to enable seamless interaction between LLMs and external tools or data.
May 8, 2025
by Akmal Chaudhri DZone Core CORE
· 2,157 Views · 4 Likes
article thumbnail
Event-Driven Architectures: Designing Scalable and Resilient Cloud Solutions
Learn how to enhance scalability, resilience, and efficiency in cloud solutions using event-driven architectures with this step-by-step guide.
May 7, 2025
by Srinivas Chippagiri DZone Core CORE
· 4,195 Views · 4 Likes
article thumbnail
Streamlining Event Data in Event-Driven Ansible
Learn how ansible.eda.json_filter and ansible.eda.normalize_keys streamline event data for Ansible automation. Simplify, standardize, and speed up workflows.
May 6, 2025
by Binoj Melath Nalinakshan Nair DZone Core CORE
· 2,200 Views · 6 Likes
article thumbnail
Beyond Linguistics: Real-Time Domain Event Mapping with WebSocket and Spring Boot
Build a scalable real-time notification system using Sprint Boot and WebSocket, focusing on domain event mapping, system design, and more.
May 6, 2025
by Soham Sengupta
· 1,657 Views · 2 Likes
article thumbnail
Unlocking the Benefits of a Private API in AWS API Gateway
Unlock new opportunities with Private APIs while staying vigilant against data exposure and unauthorized access. Learn how to secure your services effectively today.
May 5, 2025
by Satrajit Basu DZone Core CORE
· 3,816 Views · 2 Likes
article thumbnail
Mastering Fluent Bit: Installing and Configuring Fluent Bit on Kubernetes (Part 3)
This intro to mastering Fluent Bit covers what Fluent Bit is, why you want to use it on Kubernetes, and how to get it collecting logs on a cluster in minutes.
May 5, 2025
by Eric D. Schabell DZone Core CORE
· 2,713 Views · 2 Likes
article thumbnail
Microsoft Azure Synapse Analytics: Scaling Hurdles and Limitations
Azure Synapse faces serious challenges limiting its use in the Enterprise Data space, impacting performance and functionality.
May 2, 2025
by Vamshidhar Morusu
· 3,254 Views · 1 Like
article thumbnail
Docker Base Images Demystified: A Practical Guide
This article explores different base image types — scratch, Alpine, and distroless — and shares practical tips for building efficient, secure Docker images.
May 2, 2025
by Istvan Foldhazi
· 6,536 Views · 8 Likes
article thumbnail
Zero Trust for AWS NLBs: Why It Matters and How to Do It
Implementing Zero Trust with NLB helps create robust security for your network while preserving the performance benefits of network load balancing (NLB).
May 1, 2025
by Sathish Holla
· 4,095 Views · 2 Likes
article thumbnail
Docker Model Runner: Streamlining AI Deployment for Developers
Docker Model Runner is a tool introduced to simplify running and testing AI models locally, integrating seamlessly into existing workflows.
April 30, 2025
by Ram Chandra Sachan
· 22,602 Views · 14 Likes
article thumbnail
A Guide to Container Runtimes
The blog explores how to navigate various high-level container runtimes and key parameters to consider when making a choice.
April 30, 2025
by Rachit Jain
· 4,663 Views · 3 Likes
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: