DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

How are you handling the data revolution? We want your take on what's real, what's hype, and what's next in the world of data engineering.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Monitoring and Observability

Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.

icon
Latest Premium Content
Trend Report
Observability and Performance
Observability and Performance
Refcard #368
Getting Started With OpenTelemetry
Getting Started With OpenTelemetry
Refcard #293
Getting Started With Prometheus
Getting Started With Prometheus

DZone's Featured Monitoring and Observability Resources

How to Achieve SOC 2 Compliance in AWS Cloud Environments

How to Achieve SOC 2 Compliance in AWS Cloud Environments

By Chase Bolt
Did you know cloud security was one of the most evident challenges of using cloud solutions in 2023? As businesses increasingly depend on Cloud services like Amazon Web Services (AWS) to host their applications, securing sensitive data in the Cloud becomes non-negotiable. Organizations must ensure their technology infrastructure meets the highest security standards. One such standard is SOC 2 (Systems and Organization Controls 2) compliance. SOC 2 is more than a regulatory checkbox. It represents a business’s commitment to robust security measures and instills trust in customers and stakeholders. SOC 2 compliance for AWS evaluates how securely an organization’s technology setup manages data storage, processing, and transfer. Let’s further discuss SOC 2 compliance, its importance in AWS, and how organizations can achieve SOC 2 compliance for AWS. What Is SOC 2 Compliance? SOC 2 is an auditing standard developed by the American Institute of CPAs (AICPA). This standard ensures organizations protect sensitive customer data by securing their systems, processes, and controls. SOC 2 is based on five Trust Services Criteria (TSC), and achieving SOC 2 compliance involves rigorous evaluation against these criteria. Security: This criterion ensures an organization's systems and data are protected against unauthorized access, breaches, and cyber threats. It involves implementing physical security measures such as access controls, encryption, firewalls, etc.Availability: This assesses the organization's ability to ensure that its systems and services are accessible and operational whenever needed by users or stakeholders. This includes measures to prevent and mitigate downtime, such as redundancy, failover mechanisms, disaster recovery plans, and proactive monitoring. Process integrity: Process integrity evaluates the accuracy, completeness, and reliability of the organization's processes and operations. This involves implementing checks and balances to validate the accuracy of data. It also emphasizes implementing mechanisms to monitor data integrity. Confidentiality: This involves protecting sensitive information from unauthorized access, disclosure, or exposure. This includes implementing encryption, data masking, and other measures to prevent unauthorized users or entities from accessing or viewing confidential data.Privacy: It ensures customers’ personal information is handled in compliance with relevant privacy regulations and standards. This involves implementing policies, procedures, and controls to protect individuals' privacy rights. SOC 1 vs. SOC 2 vs. SOC 3: Head-to-Head Comparison Understanding the key differences between SOC 1, SOC 2, and SOC 3 is essential for organizations looking to demonstrate their commitment to security and compliance. Below is a comparison highlighting various aspects of these controls. AspectsSOC 1SOC 2SOC 3ScopeFinancial ControlsOperational and Security ControlsHigh-level operational controlsTarget AudienceAuditors, RegulatorsCustomers, Business partnersGeneral AudienceFocus AreaControls impacting the financial reporting of service organizations.Trusted Services Criteria (Security, Availability, Processing Integrity, Confidentiality, Privacy).Trusted Services Criteria (Security, Availability, Processing Integrity, Confidentiality, Privacy).Evaluation Timeline6-12 months6-12 months3-6 monthsWho Needs to ComplyCollection agencies, payroll providers, payment processing companies, etc.SaaS companies, data hosting, or processing providers, and Cloud storage providers.Organizations that require SOC 2 compliance certification and want to use it to market to the general audience. Importance of SOC 2 Compliance in AWS Understanding AWS’s shared responsibility model is important when navigating SOC 2 compliance within AWS. This model outlines the respective responsibilities of AWS and its customers. AWS’s responsibility is to secure the cloud infrastructure, while customers manage security in the cloud. This means customers are accountable for securing their data, applications, and services hosted on AWS. This model holds crucial implications for SOC 2 compliance: Data security: As a customer, it’s your responsibility to secure your data. This involves ensuring secure data transmission, implementing encryption, and controlling data access. Compliance management: You must ensure that your applications, services, and processes comply with SOC 2 requirements, necessitating continuous monitoring and management. User access management: You are responsible for configuring AWS services to meet SOC 2 requirements, including permissions and security settings.Staff training: Ensure your team is adequately trained to follow AWS security best practices and SOC 2 requirements. This is necessary to prevent non-compliance related to misunderstanding or misuse of AWS services. Challenges of Achieving SOC 2 Compliance in AWS Here is a list of some challenges businesses face when looking to achieve SOC 2 compliance on AWS. Complexity of AWS environments: Understanding the complex architecture of AWS setups requires in-depth knowledge and expertise. It can be challenging for businesses to ensure that all components are configured securely. Data protection and privacy: The dynamic nature of cyber threats and the need for comprehensive measures to prevent unauthorized access can make securing sensitive data in the AWS environment challenging. Evolving/continuous compliance requirements: Adapting to changing compliance standards requires constant monitoring and updating of policies and procedures. This can be challenging for businesses as it may strain resources and expertise. Training and awareness: Ensuring that all personnel are adequately trained and aware of their roles and responsibilities in maintaining compliance can be difficult. This challenge is prevalent in large organizations with diverse teams and skill sets.Scalability: As AWS environments grow, ensuring security measures can scale effectively to meet increasing demands becomes complex. Consequently, scaling security measures with business growth while staying compliant adds another layer of complexity. How Organizations Can Achieve SOC 2 Compliance for Their AWS Cloud Environments Achieving SOC 2 compliance in AWS involves a structured approach to ensure the best security practices. Here's a step-by-step guide: 1. Assess Your Current Landscape Start by conducting a comprehensive assessment of your current AWS environment. Examine existing security processes and controls and identify potential vulnerabilities and compliance gaps against SOC 2 requirements. This stage includes conducting internal audits, risk assessments, and evaluating existing policies and procedures. 2. Identify Required Security Controls Develop a thorough security program detailing all security controls required to meet SOC 2 compliance. This includes measures for data protection, access controls, system monitoring, and more. You can also access the AWS SOC report via the AWS Artifact tool, which provides a comprehensive list of security controls. 3. Use AWS Tools for SOC 2 Compliance Leverage the suite of security tools AWS offers to facilitate SOC 2 compliance. These include: AWS Identity and Access Management (IAM): Administers access to AWS services and resources.AWS Config: Enables you to review, audit, and analyze the configurations of your AWS resources.AWS Key Management Service (KMS): Simplifies the creation and administration of cryptographic keys, allowing control over their usage across various AWS services and within your applications.AWS CloudTrail: Offers a record of AWS API calls made within your account. This includes activities executed via AWS SDKs, AWS Management Console, Command Line tools, and additional AWS services. 4. Develop Documentation of Security Policies Document your organization's security policies and procedures in alignment with SOC 2 requirements. This includes creating detailed documentation outlining security controls, processes, and responsibilities. 5. Enable Continuous Monitoring Implement continuous monitoring mechanisms to track security events and compliance status in real time. Use AWS services like Amazon GuardDuty, AWS Config, and AWS Security Hub to automate monitoring and ensure ongoing compliance with SOC 2 standards. Typical SOC 2 Compliance Process Timeline The SOC 2 compliance process usually spans 6 to 12 months. It consists of several phases, starting from preparation to achieving compliance: Preparation (1-2 months): This initial phase involves assessing current security practices and identifying gaps. Afterward, you can develop a plan to address the identified gaps while configuring AWS services and updating policies. Implementation (3-6 months): Execute the planned AWS configurations outlined in the preparation phase. Implement necessary security controls and measures to align with SOC 2 standards.Documentation (1-2 months): Gather documentation of the AWS environment, cataloging policies, procedures, and operational practices. Conduct an internal review to ensure documentation completeness and alignment with SOC-2 requirements. Auditing (1-2 months): Engage a qualified auditor with expertise in evaluating AWS environments for SOC-2 compliance. Collaborate with the chosen auditor to execute the audit process. After the audit, the auditor will provide a detailed SOC 2 report. Conclusion Achieving SOC 2 compliance in AWS requires planning, rigorous implementation, and an ongoing commitment to security best practices. Organizations can effortlessly navigate SOC 2 compliance by complying with the shared responsibility model, using AWS tools, and maintaining continuous vigilance. As cloud-hosted applications take over the digital space, prioritizing security and compliance becomes crucial. With the right approach and dedication, organizations can attain SOC 2 compliance and strengthen their position as a trusted party. More
Your Kubernetes Survival Kit: Master Observability, Security, and Automation

Your Kubernetes Survival Kit: Master Observability, Security, and Automation

By Prabhu Chinnasamy
Kubernetes has become the de facto standard for orchestrating containerized applications. As organizations increasingly embrace cloud-native architectures, ensuring observability, security, policy enforcement, progressive delivery, and autoscaling is like ensuring your spaceship has enough fuel, oxygen, and a backup plan before launching into the vastness of production. With the rise of multi-cloud and hybrid cloud environments, Kubernetes observability and control mechanisms must be as adaptable as a chameleon, scalable like your favorite meme stock, and technology-agnostic like a true DevOps pro. Whether you're managing workloads on AWS, Azure, GCP, or an on-premises Kubernetes cluster, having a robust ecosystem of tools is not a luxury — it's a survival kit for monitoring applications, enforcing security policies, automating deployments, and optimizing performance. In this article, we dive into some of the most powerful Kubernetes-native tools that transform observability, security, and automation from overwhelming challenges into powerful enablers. We will explore tools for: Tracing and Observability: Jaeger, Prometheus, Thanos, Grafana LokiPolicy Enforcement: OPA, KyvernoProgressive Delivery: Flagger, Argo RolloutsSecurity and Monitoring: Falco, Tetragon, Datadog Kubernetes AgentAutoscaling: KedaNetworking and Service Mesh: Istio, LinkerdDeployment Validation and SLO Monitoring: Keptn So, grab your Kubernetes control panel, adjust your monitoring dashboards, and let’s navigate the wild, wonderful, and sometimes wacky world of Kubernetes observability and reliability! This diagram illustrates key Kubernetes tools for observability, security, deployment, and scaling. Each category highlights tools like Prometheus, OPA, Flagger, and Keda to enhance reliability and performance. Why These Tools Matter in a Multi-Cloud Kubernetes World Kubernetes is a highly dynamic system, managing thousands of microservices, scaling resources based on demand, and orchestrating deployments across different cloud providers. The complexity of Kubernetes requires a comprehensive observability and control strategy to ensure application health, security, and compliance. Observability: Understanding System Behavior Without proper monitoring and tracing, identifying bottlenecks, debugging issues, and optimizing performance becomes a challenge. Tools like Jaeger, Prometheus, Thanos, and Grafana Loki provide full visibility into distributed applications, ensuring that every microservice interaction is tracked, logged, and analyzed. Policy Enforcement: Strengthening Security and Compliance As Kubernetes clusters grow, managing security policies and governance becomes critical. Tools like OPA and Kyverno allow organizations to enforce fine-grained policies, ensuring that only compliant configurations and access controls are deployed across clusters. Progressive Delivery: Reducing Deployment Risks Modern DevOps and GitOps practices rely on safe, incremental releases. Flagger and Argo Rollouts automate canary deployments, blue-green rollouts, and A/B testing, ensuring that new versions of applications are introduced without downtime or major disruptions. Security and Monitoring: Detecting Threats in Real Time Kubernetes workloads are dynamic, making security a continuous process. Falco, Tetragon, and Datadog Kubernetes Agent monitor runtime behavior, detect anomalies, and prevent security breaches by providing deep visibility into container and node-level activities. Autoscaling: Optimizing Resource Utilization Kubernetes m offers built-in Horizontal Pod Autoscaling (HPA), but many workloads require event-driven scaling beyond CPU and memory thresholds. Keda enables scaling based on real-time events, such as queue length, message brokers, and custom business metrics. Networking and Service Mesh: Managing Microservice Communication In large-scale microservice architectures, network traffic management is essential. Istio and Linkerd provide service mesh capabilities, ensuring secure, reliable, and observable communication between microservices while optimizing network performance. Deployment Validation and SLO Monitoring: Ensuring Reliable Releases Keptn automates deployment validation, ensuring that applications meet service-level objectives (SLOs) before rolling out to production. This helps in maintaining stability and improving reliability in cloud-native environments. Comparison of Key Tools While each tool serves a distinct purpose, some overlap in functionality. Below is a comparison of some key tools that offer similar capabilities: Category Tool 1 Tool 2 Key Difference Tracing and Observability Jaeger Tracestore Jaeger is widely adopted for tracing, whereas Tracestore is an emerging alternative. Policy Enforcement OPA Kyverno OPA uses Rego, while Kyverno offers Kubernetes-native CRD-based policies. Progressive Delivery Flagger Argo Rollouts Flagger integrates well with service meshes, Argo Rollouts is optimized for GitOps workflows. Security Monitoring Falco Tetragon Falco focuses on runtime security alerts, while Tetragon extends eBPF-based monitoring. Networking and Service Mesh Istio Linkerd Istio offers more advanced features but is complex; Linkerd is simpler and lightweight. 1. Tracing and Observability With Jaeger What is Jaeger? Jaeger is an open-source distributed tracing system designed to help Kubernetes users monitor and troubleshoot transactions in microservice architectures. Originally developed by Uber, it has become a widely adopted solution for end-to-end request tracing. Why Use Jaeger in Kubernetes? Distributed Tracing: Provides visibility into request flows across multiple microservices.Performance Bottleneck Detection: Helps identify slow service interactions and dependencies.Root Cause Analysis: Enables debugging of latency issues and failures.Seamless Integration: Works well with Prometheus, OpenTelemetry, and Grafana.Multi-Cloud Ready: Deployable across AWS, Azure, and GCP Kubernetes clusters for global observability. Comparison: Jaeger vs. Tracestore Feature Jaeger Tracestore Adoption Widely adopted in Kubernetes environments Emerging solution Open-Source Yes Limited information available Integration Works with OpenTelemetry, Prometheus, and Grafana Less integration support Use Case Distributed tracing, root cause analysis Similar use case but less proven Jaeger is the preferred choice for most Kubernetes users due to its mature ecosystem, active community, and strong integration capabilities. How Jaeger is Used in Multi-Cloud Environments Jaeger can be deployed in multi-cluster and multi-cloud environments by: Deploying Jaeger as a Kubernetes service to trace transactions across microservices.Using OpenTelemetry for tracing and sending trace data to Jaeger for analysis.Storing trace data in distributed storage solutions like Elasticsearch or Cassandra for scalability.Integrating with Grafana to visualize trace data alongside Kubernetes metrics. In short, Jaeger is an essential tool for observability and debugging in modern cloud-native architectures. Whether running Kubernetes workloads on-premise or across multiple cloud providers, it provides a robust solution for distributed tracing and performance monitoring. This diagram depicts Jaeger tracing the flow of requests across multiple services (e.g., Service A → Service B → Service C). Jaeger UI visualizes the traces, helping developers analyze latency issues, bottlenecks, and request paths in microservices architectures. Observability With Prometheus What is Prometheus? Prometheus is an open-source monitoring and alerting toolkit designed specifically for cloud-native environments. As part of the Cloud Native Computing Foundation (CNCF), it has become the default monitoring solution for Kubernetes due to its reliability, scalability, and deep integration with containerized applications. Why Use Prometheus in Kubernetes? Time-Series Monitoring: Captures metrics in a time-series format, enabling historical analysis.Powerful Query Language (PromQL): Allows users to filter, aggregate, and analyze metrics efficiently.Scalability: It handles massive workloads across large Kubernetes clusters.Multi-Cloud Deployment: Can be deployed across AWS, Azure, and GCP Kubernetes clusters for unified observability.Integration with Grafana: Provides real-time dashboards and visualizations.Alerting Mechanism: Works with Alertmanager to notify teams about critical issues. How Prometheus Works in Kubernetes Prometheus scrapes metrics from various sources within the Kubernetes cluster, including: Kubernetes API Server for node and pod metrics.Application Endpoints exposing Prometheus-formatted metrics.Node Exporters for host-level system metrics.Custom Metrics Exporters for application-specific insights. How Prometheus is Used in Multi-Cloud Environments Prometheus supports multi-cloud observability by: Deploying Prometheus instances per cluster to collect and store local metrics.Using Thanos or Cortex for long-term storage, enabling centralized querying across multiple clusters.Integrating with Grafana to visualize data from different cloud providers in a single dashboard.Leveraging Alertmanager to route alerts dynamically based on cloud-specific policies. In short, Prometheus is the go-to monitoring solution for Kubernetes, providing powerful observability into containerized workloads. When combined with Grafana, Thanos, and Alertmanager, it forms a comprehensive monitoring stack suitable for both single-cluster and multi-cloud environments. This diagram shows how Prometheus scrapes metrics from multiple services (e.g., Service 1 and Service 2) and sends the collected data to Grafana for visualization. Grafana serves as the user interface where metrics are displayed in dashboards for real-time monitoring and alerting. Long-Term Metrics Storage With Thanos What is Thanos? Thanos is an open-source system designed to extend Prometheus' capabilities by providing long-term metrics storage, high availability, and federated querying across multiple clusters. It ensures that monitoring data is retained for extended periods while allowing centralized querying of distributed Prometheus instances. Why Use Thanos in Kubernetes? Long-Term Storage: Retains Prometheus metrics indefinitely, overcoming local retention limits.High Availability: This ensures continued access to metrics even if a Prometheus instance fails.Multi-Cloud and Multi-Cluster Support: Enables federated monitoring across Kubernetes clusters on AWS, Azure, and GCP.Query Federation: Aggregates data from multiple Prometheus instances into a single view.Cost-Effective Storage: It supports object storage backends like Amazon S3, Google Cloud Storage, and Azure Blob Storage. How Thanos Works With Prometheus Thanos extends Prometheus by introducing the following components: Sidecar: Attaches to Prometheus instances and uploads data to object storage.Store Gateway: Allows querying of stored metrics across clusters.Querier: Provides a unified API for running queries across multiple Prometheus deployments.Compactor: Optimizes and deduplicates historical data. Comparison: Prometheus vs. Thanos Feature Prometheus Thanos Data Retention Limited (based on local storage) Long-term storage in object stores High Availability No built-in redundancy HA setup with global querying Multi-Cluster Support Single-cluster focus Multi-cluster observability Query Federation Not supported Supported across clusters In short, Thanos is a must-have addition to Prometheus for organizations running multi-cluster and multi-cloud Kubernetes environments. It provides scalability, availability, and long-term storage, ensuring that monitoring data is never lost and remains accessible across distributed systems. Log Aggregation and Observability With Grafana Loki What is Grafana Loki? Grafana Loki is a log aggregation system designed specifically for Kubernetes environments. Unlike traditional log management solutions, Loki does not index log content, making it highly scalable and cost-effective. It integrates seamlessly with Prometheus and Grafana, allowing users to correlate logs with metrics for better troubleshooting. Why Use Grafana Loki in Kubernetes? Lightweight and Efficient: It does not require full-text indexing, reducing storage and processing costs.Scalability: It handles high log volume across multiple Kubernetes clusters.Multi-Cloud Ready: Can be deployed on AWS, Azure, and GCP, supporting centralized log aggregation.Seamless Prometheus Integration: Allows correlation of logs with Prometheus metrics.Powerful Query Language (LogQL): Enables efficient filtering and analysis of logs. How Grafana Loki Works in Kubernetes Loki ingests logs from multiple sources, including: Promtail: A lightweight log agent that collects logs from Kubernetes pods.Fluentd/Fluent Bit: Alternative log collectors for forwarding logs to Loki.Grafana Dashboards: Visualizes logs alongside Prometheus metrics for deep observability. Comparison: Grafana Loki vs. Traditional Log Management Feature Grafana Loki Traditional Log Systems (ELK, Splunk) Indexing Only index labels (lightweight) Full-text indexing (resource-intensive) Scalability Optimized for large-scale clusters Requires significant storage and CPU Cost Lower cost due to minimal indexing Expensive due to indexing overhead Integration Works natively with Prometheus and Grafana Requires additional integrations Querying Uses LogQL for efficient filtering Uses full-text search and queries In short, Grafana Loki is a powerful yet lightweight log aggregation tool that provides scalable and cost-effective log management for Kubernetes environments. By integrating with Grafana and Prometheus, it enables full-stack observability, allowing teams to quickly diagnose issues and improve system reliability. This diagram shows Grafana Loki collecting logs from multiple services (e.g., Service 1 and Service 2) and forwarding them to Grafana for visualization. Loki efficiently stores logs, while Grafana provides an intuitive interface for analyzing and troubleshooting logs. 2. Policy Enforcement With OPA and Kyverno What is OPA? Open Policy Agent (OPA) is an open-source policy engine that provides fine-grained access control and governance for Kubernetes workloads. OPA allows users to define policies using Rego, a declarative query language, to enforce rules across Kubernetes resources. Why Use OPA in Kubernetes? Fine-Grained Policy Enforcement: Enables strict access control at all levels of the cluster.Dynamic Admission Control: Evaluates and enforces policies before resources are deployed.Auditability and Compliance: Ensures Kubernetes configurations follow compliance frameworks.Integration with CI/CD Pipelines: Validates Kubernetes manifests before deployment. This diagram illustrates how OPA handles incoming user requests by evaluating security policies. Requests are either allowed or denied based on these policies. Allowed requests proceed to the Kubernetes service, ensuring policy enforcement for secure access control. What is Kyverno? Kyverno is a Kubernetes-native policy management tool that enforces security and governance rules using Kubernetes Custom Resource Definitions (CRDs). Unlike OPA, which requires learning Rego, Kyverno enables users to define policies using familiar Kubernetes YAML. Why Use Kyverno in Kubernetes? Kubernetes-Native: Uses CRDs instead of a separate policy language.Easy Policy Definition: Allows administrators to write policies using standard Kubernetes configurations.Mutation and Validation: Can modify resource configurations dynamically.Simplified Governance: Enforces best practices for security and compliance. Comparison: OPA vs. Kyverno Feature OPA Kyverno Policy Language Uses Rego (custom query language) Uses native Kubernetes YAML Integration Works with Kubernetes and external apps Primarily for Kubernetes workloads Mutation No built-in mutation support Supports modifying configurations Ease of Use Requires learning Rego Simple for Kubernetes admins How OPA and Kyverno Work in Multi-Cloud Environments Both OPA and Kyverno help maintain consistent policies across Kubernetes clusters deployed on different cloud platforms. OPA: Used in multi-cloud scenarios where policy enforcement extends beyond Kubernetes (e.g., APIs, CI/CD pipelines).Kyverno: Ideal for Kubernetes-only policy management across AWS, Azure, and GCP clusters.Global Policy Synchronization: Ensures that all clusters follow the same security and governance policies. In short, both OPA and Kyverno offer robust policy enforcement for Kubernetes environments, but the right choice depends on the complexity of governance needs. OPA is powerful for enterprise-scale policies across various systems, while Kyverno simplifies Kubernetes-native policy enforcement. 3. Progressive Delivery With Flagger and Argo Rollouts What is Flagger? Flagger is a progressive delivery tool designed for automated canary deployments, blue-green deployments, and A/B testing in Kubernetes. It integrates with service meshes like Istio, Linkerd, and Consul to shift traffic between different application versions based on real-time metrics. Why Use Flagger in Kubernetes? Automated Canary Deployments: Gradually shift traffic to a new version based on performance.Traffic Management: Works with service meshes to control routing dynamically.Automated Rollbacks: Detects failures and reverts to a stable version if issues arise.Metrics-Based Decision Making: Uses Prometheus, Datadog, or other observability tools to determine release stability.Multi-Cloud Ready: It can be deployed across Kubernetes clusters in AWS, Azure, and GCP. What are Argo Rollouts? Argo Rollouts is a Kubernetes controller for progressive delivery strategies, including blue-green deployments, canary releases, and experimentation. It is part of the Argo ecosystem, making it a great choice for GitOps-based workflows. Why Use Argo Rollouts in Kubernetes? GitOps-Friendly: It integrates seamlessly with Argo CD for declarative deployments.Advanced Traffic Control: Works with Ingress controllers and service meshes to shift traffic dynamically.Feature-Rich Canary Deployments: Supports progressive rollouts with fine-grained control over traffic shifting.Automated Analysis and Promotion: Evaluates new versions based on key performance indicators (KPIs) before full rollout.Multi-Cloud Deployment: Works across different cloud providers for global application releases. Comparison: Flagger vs. Argo Rollouts Feature Flagger Argo Rollouts Integration Works with service meshes (Istio, Linkerd) Works with Ingress controllers, Argo CD Deployment Strategies Canary, Blue-Green, A/B Testing Canary, Blue-Green, Experimentation Traffic Control Uses service mesh for traffic shifting Uses ingress controllers and service mesh Rollbacks Automated rollback based on metrics Automated rollback based on analysis Best for Service mesh-based progressive delivery GitOps workflows and feature flagging How Flagger and Argo Rollouts Work in Multi-Cloud Environments Both tools enhance multi-cloud deployments by ensuring safe, gradual releases across Kubernetes clusters. Flagger: Works best in service mesh environments, allowing traffic-based gradual deployments across cloud providers.Argo Rollouts: Ideal for GitOps-driven pipelines, making declarative, policy-driven rollouts across multiple cloud clusters seamless. In short, both Flagger and Argo Rollouts provide progressive delivery mechanisms to ensure safe, automated, and data-driven deployments in Kubernetes. Choosing between them depends on infrastructure setup (service mesh vs. ingress controllers) and workflow preference (standard Kubernetes vs. GitOps). 4. Security and Monitoring With Falco, Tetragon, and Datadog Kubernetes Agent What is Falco? Falco is an open-source runtime security tool that detects anomalous activity in Kubernetes clusters. It leverages Linux kernel system calls to identify suspicious behaviors in real time. Why Use Falco in Kubernetes? Runtime Threat Detection: Identifies security threats based on kernel-level events.Compliance Enforcement: Ensures best practices by monitoring for unexpected system activity.Flexible Rule Engine: Allows users to define custom security policies.Multi-Cloud Ready: Works across Kubernetes clusters in AWS, Azure, and GCP. This diagram demonstrates Falco’s role in monitoring Kubernetes nodes for suspicious activities. When Falco detects unexpected behavior, it generates alerts for immediate action, helping ensure runtime security in Kubernetes environments. What is Tetragon? Tetragon is an eBPF-based security observability tool that provides deep visibility into process execution, network activity, and privilege escalations in Kubernetes. Why Use Tetragon in Kubernetes? High-Performance Security Monitoring: Uses eBPF for minimal overhead.Process-Level Observability: Tracks container execution and system interactions.Real-Time Policy Enforcement: Blocks malicious activities dynamically.Ideal for Zero-Trust Environments: Strengthens security posture with deep runtime insights. What is Datadog Kubernetes Agent? The Datadog Kubernetes Agent is a full-stack monitoring solution that provides real-time observability across metrics, logs, and traces, integrating seamlessly with Kubernetes environments. Why Use Datadog Kubernetes Agent? Unified Observability: Combines metrics, logs, and traces in a single platform.Security Monitoring: Detects security events and integrates with compliance frameworks.Multi-Cloud Deployment: Works across AWS, Azure, and GCP clusters.AI-powered alerts: Uses machine learning to identify anomalies and prevent incidents. Comparison: Falco vs. Tetragon vs. Datadog Kubernetes Agent Feature Falco Tetragon Datadog Kubernetes Agent Monitoring Focus Runtime security alerts Deep process-level security insights Full-stack observability and security Technology Uses kernel system calls Uses eBPF for real-time insights Uses agent-based monitoring Anomaly Detection Detects rule-based security events Detects system behavior anomalies AI-driven anomaly detection Best for Runtime security and compliance Deep forensic security analysis Comprehensive monitoring and security How These Tools Work in Multi-Cloud Environments Falco: Monitors Kubernetes workloads in real time across cloud environments.Tetragon: Provides low-latency security insights, ideal for large-scale, multi-cloud Kubernetes deployments.Datadog Kubernetes Agent: Unifies security and observability for Kubernetes clusters running across AWS, Azure, and GCP. In short, each of these tools serves a unique purpose in securing and monitoring Kubernetes workloads. Falco is great for real-time anomaly detection, Tetragon provides deep security observability, and Datadog Kubernetes Agent offers a comprehensive monitoring solution. 5. Autoscaling with Keda What is Keda? Kubernetes Event-Driven Autoscaling (Keda) is an open-source autoscaler that enables Kubernetes workloads to scale based on event-driven metrics. Unlike traditional Horizontal Pod Autoscaling (HPA), which primarily relies on CPU and memory usage, Keda can scale applications based on custom metrics such as queue length, database connections, and external event sources. Why Use Keda in Kubernetes? Event-Driven Scaling: Supports scaling based on external event sources (Kafka, RabbitMQ, Prometheus, etc.).Efficient Resource Utilization: Reduces the number of running pods when demand is low, cutting costs.Multi-Cloud Support: Works across Kubernetes clusters in AWS, Azure, and GCP.Works with Existing HPA: Extends Kubernetes' built-in Horizontal Pod Autoscaler.Flexible Metrics Sources: Can scale applications based on logs, messages, or database triggers. How Keda Works in Kubernetes Keda consists of two main components: Scaler: Monitors external event sources (e.g., Azure Service Bus, Kafka, AWS SQS) and determines when scaling is needed.Metrics Adapter: Passes event-based metrics to Kubernetes' HPA to trigger pod scaling. Comparison: Keda vs. Traditional HPA Feature Traditional HPA Keda Scaling Trigger CPU and Memory Usage External events (queues, messages, DB, etc.) Event-Driven No Yes Custom Metrics Limited support Extensive support via external scalers Best for CPU/Memory-bound workloads Event-driven applications How Keda Works in Multi-Cloud Environments AWS: Scales applications based on SQS queue depth or DynamoDB load.Azure: Supports Azure Event Hub, Service Bus, and Functions.GCP: Integrates with Pub/Sub for event-driven scaling.Hybrid/Multi-Cloud: Works across cloud providers by integrating with Prometheus, RabbitMQ, and Redis. In short, Keda is a powerful autoscaling solution that extends Kubernetes's capabilities beyond CPU and memory-based scaling. It is particularly useful for microservices and event-driven applications, making it a key tool for optimizing workloads across multi-cloud Kubernetes environments. This diagram represents how Keda scales Kubernetes pods dynamically based on external event sources like Kafka, RabbitMQ, or Prometheus. When an event trigger is detected, Keda scales pods in the Kubernetes cluster accordingly to handle increased demand. 6. Networking and Service Mesh With Istio and Linkerd What is Istio? Istio is a powerful service mesh that provides traffic management, security, and observability for microservices running in Kubernetes. It abstracts network communication between services and enhances reliability through load balancing, security policies, and tracing. Why Use Istio in Kubernetes? Traffic Management: Implements fine-grained control over traffic routing, including canary deployments and retries.Security and Authentication: Enforces zero-trust security with mutual TLS (mTLS) encryption.Observability: It integrates with tools like Prometheus, Jaeger, and Grafana for deep monitoring.Multi-Cloud and Hybrid Support: Works across Kubernetes clusters in AWS, Azure, and GCP.Service Discovery and Load Balancing: Automatically discovers services and balances traffic efficiently. This diagram illustrates how Istio controls traffic flow between services (e.g., Service A and Service B). Istio enables mTLS encryption for secure communication and offers traffic control capabilities to manage service-to-service interactions within the Kubernetes cluster. What is Linkerd? Linkerd is a lightweight service mesh designed to be simpler and faster than Istio while providing essential networking capabilities. It offers automatic encryption, service discovery, and observability for microservices. Why Use Linkerd in Kubernetes? Lightweight and Simple: Easier to deploy and maintain than Istio.Automatic mTLS: Provides encrypted communication between services by default.Low Resource Consumption: Requires fewer system resources than Istio.Native Kubernetes Integration: Uses Kubernetes constructs for streamlined management.Reliable and Fast: Optimized for performance with minimal overhead. Comparison: Istio vs. Linkerd Feature Istio Linkerd Complexity Higher complexity, more features Simpler, easier to deploy Security Advanced security (mTLS, RBAC) Lightweight mTLS encryption Observability Deep integration with tracing and monitoring tools Basic logging and metrics support Performance More resource-intensive Lightweight, optimized for speed Best for Large-scale enterprise deployments Teams needing a simple service mesh How Istio and Linkerd Work in Multi-Cloud Environments Istio: Ideal for enterprises running multi-cloud Kubernetes clusters with advanced security, routing, and observability needs.Linkerd: Suitable for lightweight service mesh deployments across hybrid cloud environments where simplicity and performance are key. In short, both Istio and Linkerd are excellent service mesh solutions, but the choice depends on your organization's needs. Istio is best for feature-rich, enterprise-scale networking, while Linkerd is ideal for those who need a simpler, lightweight solution with strong security and observability. 7. Deployment Validation and SLO Monitoring With Keptn What is Keptn? Keptn is an open-source control plane that automates deployment validation, service-level objective (SLO) monitoring, and incident remediation in Kubernetes. It helps organizations ensure that applications meet predefined reliability standards before and after deployment. Why Use Keptn in Kubernetes? Automated Quality Gates: Validates deployments against SLOs before full release.Continuous Observability: Monitors application health using Prometheus, Dynatrace, and other tools.Self-Healing Capabilities: Detects performance degradation and triggers remediation workflows.Multi-Cloud Ready: Works across Kubernetes clusters on AWS, Azure, and GCP.Event-Driven Workflow: Uses cloud-native events to trigger automated responses. How Keptn Works in Kubernetes Keptn integrates with Kubernetes to provide automated deployment verification and continuous performance monitoring: Quality Gates: Ensures that applications meet reliability thresholds before deployment.Service-Level Indicators (SLIs): Monitors key performance metrics (latency, error rate, throughput).SLO Evaluation: Compares SLIs against pre-defined objectives to determine deployment success.Remediation Actions: Triggers rollback or scaling actions if service quality degrades. Comparison: Keptn vs. Traditional Monitoring Tools Feature Keptn Traditional Monitoring (e.g., Prometheus) SLO-Based Validation Yes No Automated Rollbacks Yes Manual intervention required Event-Driven Actions Yes No Remediation Workflows Yes No Multi-Cloud Support Yes Yes How Keptn Works in Multi-Cloud Environments AWS: Works with AWS Lambda, EKS, and CloudWatch for automated remediation.Azure: It integrates with Azure Monitor and AKS for SLO-driven validation.GCP: Supports GKE and Stackdriver for continuous monitoring.Hybrid Cloud: Works across multiple Kubernetes clusters for unified service validation. In short, Keptn is a game-changer for Kubernetes deployments, enabling SLO-based validation, self-healing, and continuous reliability monitoring. By automating deployment verification and incident response, Keptn ensures that applications meet performance and availability standards across multi-cloud Kubernetes environments. Conclusion Kubernetes observability and reliability are essential for ensuring seamless application performance across multi-cloud and hybrid cloud environments. The tools discussed in this guide — Jaeger, Prometheus, Thanos, Grafana Loki, OPA, Kyverno, Flagger, Argo Rollouts, Falco, Tetragon, Datadog Kubernetes Agent, Keda, Istio, Linkerd, and Keptn — help organizations optimize monitoring, security, deployment automation, and autoscaling. By integrating these tools into your Kubernetes strategy, you can achieve enhanced visibility, automated policy enforcement, secure deployments, and efficient scalability, ensuring smooth operations in any cloud environment. More
Mastering Kubernetes Observability: Boost Performance, Security, and Stability With Tracestore, OPA, Flagger, and Custom Metrics
Mastering Kubernetes Observability: Boost Performance, Security, and Stability With Tracestore, OPA, Flagger, and Custom Metrics
By Prabhu Chinnasamy
Exploring Reactive and Proactive Observability in the Modern Monitoring Landscape
Exploring Reactive and Proactive Observability in the Modern Monitoring Landscape
By Abeetha Bala
Secure Your Oracle Database Passwords in AWS RDS With a Password Verification Function
Secure Your Oracle Database Passwords in AWS RDS With a Password Verification Function
By arvind toorpu DZone Core CORE
Building Generative AI Services: An Introductory and Practical Guide
Building Generative AI Services: An Introductory and Practical Guide

Amazon Web Services (AWS) offers a vast range of generative artificial intelligence solutions, which allow developers to add advanced AI capabilities to their applications without having to worry about the underlying infrastructure. This report highlights the creation of functional applications using Amazon Bedrock, which is a serverless offering based on an API that provides access to core models from leading suppliers, including Anthropic, Stability AI, and Amazon. As the demand for AI-powered applications grows, developers seek easy and scalable solutions to integrate generative AI into their applications. AWS provides this capability through the firm's proprietary generative AI services, and the standout among these is Amazon Bedrock. Amazon Bedrock enables you to access foundation models via API without worrying about underlying infrastructure, scaling, and model training. Through this practical guide, you learn how to utilize Bedrock to achieve a variety of generation tasks, including Q&A, summarization, image generation, conversational AI, and semantic search. Local Environment Setup Let's get started by setting up the AWS SDK for Python and configuring our AWS credentials. Shell pip install boto3 aws configure Confirm that your account has access to the Bedrock service and underlying foundation models via the AWS console. Once done, we can experiment with some generative AI use cases! Intelligent Q&A With Claude v2 The current application demonstrates how one can create a question-and-answer assistant using the Anthropic model v2. Forming the input as a conversation allows you to instruct the assistant to give concise, on-topic answers to user questions. Such an application is especially ideal for customer service, knowledge bases, or virtual helpdesk agents. Let's take a look at a practical example of talking with Claude: Python import boto3 import json client = boto3.client("bedrock-runtime", region_name="us-east-1") body = { "prompt": "Human: How can I reset my password?\n\nAssistant:", "max_tokens_to_sample": 200, "temperature": 0.7, "stop_sequences": ["\nHuman:"] } response = client.invoke_model( modelId="anthropic.claude-v2", contentType="application/json", accept="application/json", body=json.dumps(body) ) print(response['body'].read().decode()) This prompt category simulates a human question while a knowledgeable assistant gives structured and coherent answers. A variation of this method can be utilized to create custom assistants that provide logically correct responses to user queries. Summarization Using Amazon Titan Amazon Titan text model enables easy summarization of long texts to concise and meaningful abstractions. Amazon Titan text model greatly improves the reading experience, enhances user engagement, and minimizes cognitive loads for such applications as news reporting, legal documents, and research papers. Python body = { "inputText": "Cloud computing provides scalable IT resources via the internet...", "taskType": "summarize" } response = client.invoke_model( modelId="amazon.titan-text-lite-v1", contentType="application/json", accept="application/json", body=json.dumps(body) ) print(response['body'].read().decode()) By altering the nature of the task and the source text, we can implement the same strategy in content simplification, keyword extraction, and paraphrasing. Text-to-Image Generation Using Stability AI Visual content is crucial to marketing, social media, and product design. Using Stability AI's Stable Diffusion model in Bedrock, a user can generate images from text prompts, thus simplifying creative workflows or enabling real-time content generation features. Python import base64 from PIL import Image from io import BytesIO body = { "prompt": "A futuristic smart ring with a holographic display on a table", "cfg_scale": 10, "steps": 50 } response = client.invoke_model( modelId="stability.stable-diffusion-xl-v0", contentType="application/json", accept="application/json", body=json.dumps(body) ) image_data = json.loads(response['body'].read()) img_bytes = base64.b64decode(image_data['artifacts'][0]['base64']) Image.open(BytesIO(img_bytes)).show() This technique is especially well-adapted to user interface mockups, game industry asset production, or real-time visualization tools in design software. Conversation With Claude v2 Let's expand on the Q&A example. For example, this sample use case demonstrates a sample multi-turn conversation experience in Claude v2. The assistant maintains context and answers properly through conversational steps: Python conversation = """ Human: Help me plan a trip to Seattle. Assistant: Sure! Business or leisure? Human: Leisure. Assistant: """ body = { "prompt": conversation, "max_tokens_to_sample": 200, "temperature": 0.5, "stop_sequences": ["\nHuman:"] } response = client.invoke_model( modelId="anthropic.claude-v2", contentType="application/json", accept="application/json", body=json.dumps(body) ) print(response['body'].read().decode() Interacting in multi-turn conversations is crucial for building booking agents, chatbots, or any agent that is meant to gather sequential information from users. Using Embeddings for Retrieval Text embeddings are quantitative representations containing semantic meaning. Amazon Titan generates embeddings that can be stored in vector databases to be used in semantic search, recommendation systems, or similarity measurement. Python body = { "inputText": "Explain zero trust architecture." } response = client.invoke_model( modelId="amazon.titan-embed-text-v1", contentType="application/json", accept="application/json", body=json.dumps(body) ) embedding_vector = json.loads(response["body".read()])['embedding'] print(len(embedding_vector)) You can retrieve documents by meaning using embeddings, which greatly improves retrieval efficiency for consumer and enterprise applications. Additional Day-to-Day Applications By integrating these important usage scenarios, developers can build well-architected production-grade applications. For example: A customer service system can make use of Claude to interact in question-and-answer conversations, utilize Titan to summarize content, and employ embeddings to search for documents.A design application can utilize Stable Diffusion to generate images based on user-defined parameters.A bot driven by Claude can escalate requests to the human through AWS Lambda functions in the bot. AWS Bedrock provides out-of-box integration for services including Amazon Kendra (enterprise search across documents), AWS Lambda (serverless backend functionality), and Amazon API Gateway (scalable APIs) to enable full-stack generative applications. Conclusion Generative AI services from AWS, especially Amazon Bedrock, provide developers with versatile, scalable tools to implement advanced AI use cases with ease. By using serverless APIs to invoke text, image, and embedding models, you can accelerate product development without managing model infrastructure. Whether building assistants, summarizers, generators, or search engines, Bedrock delivers enterprise-grade performance and simplicity.

By Srinivas Chippagiri DZone Core CORE
From Code to Customer: Building Fault-Tolerant Microservices With Observability in Mind
From Code to Customer: Building Fault-Tolerant Microservices With Observability in Mind

Microservices have become the go-to approach for building systems that need to scale efficiently and stay resilient under pressure. However, a microservices architecture comes with many potential points of failure—dozens or even hundreds of distributed components communicating over a network. To ensure your code makes it all the way to the customer without hiccups, you need to design for failure from the start. This is where fault tolerance and observability come in. By embracing Site Reliability Engineering (SRE) practices, developers can build microservices that not only survive failures but automatically detect and recover from them. In this article, we’ll explore how to build fault-tolerant backend microservices on Kubernetes, integrating resilience patterns (retries, timeouts, circuit breakers, bulkheads, rate limiting, etc.) with robust observability, monitoring, and alerting. We’ll also compare these resilience strategies and provide practical examples—from Kubernetes health probes to alerting rules—to illustrate how to keep services reliable from code to customer. Microservices, Kubernetes, and the Need for Resilience Microservices offer benefits like scalability, elasticity, and agility, but their distributed nature means more things can go wrong . A single user request might traverse multiple services; if any one of them fails or slows down, the whole user experience suffers. In a monolithic system, failures are usually contained within a single process. But in a microservices architecture, if failures aren’t handled carefully, they can ripple across services and cause broader system issues. Running microservices on Kubernetes adds another layer: while Kubernetes provides self-healing (it can restart crashed containers) and horizontal scaling, it’s still up to us to ensure each service is resilient to partial failures. Fault tolerance is a system’s ability to keep running—even if something goes wrong. It might not work perfectly, but it can still function, often in a limited or degraded mode, instead of crashing completely. This is different from just high availability. Fault tolerance expects that components will fail and designs the system to handle it gracefully (ideally without downtime). In practice, building fault-tolerant microservices means anticipating failures—timeouts, errors, crashes, network issues—and coding defensively against them. Site Reliability Engineering (SRE), introduced by Google, bridges the gap between development and operations, with reliability at its core. An SRE approach encourages designing systems with reliability as a feature. Key SRE concepts include redundancy, automation of recovery, and setting Service Level Objectives (SLOs) (e.g. “99.9% of requests succeed within 200ms”). In a microservices context, SRE practices translate to using robust resilience patterns and strong observability so that we can meet our SLOs and catch issues early. The SRE mindset of “embracing failure” means designing systems with the expectation that things will go wrong. So instead of hoping everything always works, we build in ways to handle failures gracefully when they happen. Observability is critical here—it’s the property of the system that allows us to understand its internal state from the outside. As the saying goes, “you can’t fix what you can’t see.” We need deep visibility into our microservices in production to detect failures and anomalous behavior. Traditional monitoring might not be enough for highly distributed systems, so modern observability focuses on collecting rich telemetry (logs, metrics, traces) and correlating it to get a clear picture of system health. Observability, combined with alerting, ensures that when something does go wrong (and eventually it will), we know about it immediately and can pinpoint the cause. In summary, microservices demand resilience. By leveraging Kubernetes features and SRE best practices—designing for failure, implementing resilience patterns, and instrumenting everything for observability—we can build services that keep running smoothly from code to customer even when the unexpected happens. Resilience Patterns for Fault-Tolerant Microservices One of the best ways to build fault tolerance is to implement proven resilience design patterns in your microservices. These patterns, often used in combination, help the system handle errors gracefully and prevent failures from snowballing. Below we describe key resilience techniques—retries, circuit breakers, timeouts, rate limiting, bulkheads, and graceful degradation—and how they improve reliability. Retries and Exponential Backoff Retrying a failed operation is a simple but effective way to handle transient faults. If a request to a service times out or returns an error due to a temporary issue (like a brief network glitch or the service being momentarily overwhelmed), the calling service can wait a bit and try the request again. Often the second attempt will succeed if the issue was transient. This improves user experience by avoiding unnecessary errors for one-off hiccups. However, retries must be used carefully. It’s crucial to limit the number of retries and use exponential backoff (increasing the wait time between attempts) to avoid a retry storm that might flood the network or the struggling service. For example, a service might retry after 1 second, then 2 seconds, then 4 seconds, and give up after 3 attempts. Backoff (plus a bit of random jitter) prevents all clients from retrying in sync and knocking a service down harder. It’s also important that the operation being retried is idempotent, meaning it can be repeated without side effects. If not, a retry might cause duplicate actions (like charging a customer twice). In practice, many frameworks support configurable retry policies. For instance, in Java you might use Resilience4j or Spring Retry to automatically retry calls, and in .NET the Polly library does similarly. Pros: Retries are straightforward to implement and great for transient failures that are likely to succeed on a second try. They improve resiliency without much complex logic. Cons: If not bounded, retries can pile on extra load to a service that's already having a hard time — which can just make the problem worse. They also can increase latency for the end user (while the caller keeps retrying). That’s why combining retries with the next pattern (timeouts) and with circuit breakers is critical to know when to stop retrying. Timeouts to Prevent Hanging In any distributed system call, always apply a timeout. A timeout sets how long a service will wait for a response from another service or database before it decides to give up. Without timeouts, requests could get stuck waiting forever, tying up resources and blocking other requests from moving forward. Timeouts ensure that if a downstream service is unresponsive or slow, you fail fast and free up that thread or connection to do other work. Setting the right timeout value requires balancing: too long a timeout and the user waits unnecessarily; too short and you might cut off a successful response. A common approach is to base timeouts on expected SLA of the service – e.g. if Service B usually responds in 100ms, maybe set a 300ms timeout for some buffer. If the timeout is exceeded, you can treat it as an error (often triggering a retry or a fallback). Timeouts work hand-in-hand with retries: for example, timeout after 300ms, then retry (with backoff). They also feed into circuit breakers (too many timeouts might trip the breaker, as repeated slow responses indicate a problem). Pros: Timeouts prevent slow services from stalling the entire system . They release resources promptly and trigger your error-handling logic (retry or fallback) so that the user isn’t stuck waiting. This helps the system keep running smoothly. Cons: Choosing timeout values can be tricky. If a timeout is set too low, you might cancel requests that would have succeeded (causing a failure that wouldn’t have happened). On the other hand, a too-high timeout just delays the inevitable. Also, a timeout by itself doesn’t solve the problem – you need a plan for what to do next (retry, return an error, etc.). It’s one piece of the puzzle, but an essential one. Circuit Breakers When a downstream service is consistently failing or unreachable, circuit breakers step in to protect your system. A circuit breaker is analogous to an electrical circuit breaker in your house: when too many faults happen, it “trips” and opens the circuit, preventing further attempts that are likely to fail. In microservices, a circuit breaker watches the interactions between services and stops calls to a service that is either down or severely struggling . This stops the system from wasting effort on calls that are unlikely to succeed and gives the downstream service some breathing room to recover. Circuit Breaker States: A circuit breaker usually cycles through three states—Closed, Open, and Half-Open. When it’s Closed, everything’s running smoothly, and requests pass through as normal. But if failures start piling up (like hitting 50% errors or too many timeouts in a short time), the breaker flips to Open. In the Open state, calls get blocked right away—instead of repeatedly trying a service that’s struggling, you immediately return an error or a fallback, saving time and resources. After some cooldown period, the breaker enters Half-Open state, where it lets a small number of test requests through. If those calls go through successfully, it means the service has recovered, so the breaker closes and things go back to normal. If the test requests fail, the breaker goes back to Open for another wait period. This cycle prevents endless failures and tries to only resume calls when the service seems likely to be OK. Circuit Breaker Pattern States – the breaker opens when failures exceed a threshold, then periodically allows test requests in half-open state to probe if the service has recovered. Using circuit breakers, you can also provide fallback logic when the breaker is open – e.g., return cached data or a default response, so the user gets something useful (this is part of graceful degradation, discussed later). Libraries like Netflix Hystrix (now in maintenance) or newer ones like Resilience4j implement the circuit breaker pattern for you, tracking error rates and managing these state transitions automatically. Service meshes (like Istio) can even do circuit breaking at the network level for any service. Pros: Circuit breakers help stop problems from spreading by isolating the service that’s failing. They also fail fast—a user gets an error or fallback immediately instead of waiting through retries to a down service. This improves overall system responsiveness and stability. Additionally, by shedding load to an unhealthy service, that service gets breathing room to recover rather than being hammered continuously. Cons: Circuit breakers add complexity – there’s an extra component to configure and monitor. You need to tune the thresholds (how many failures trip it? how long to stay open? how many trial requests in half-open?) for your specific traffic patterns and failure modes. Misconfiguration can either make the breaker too sensitive (tripping too often) or not sensitive enough. Also, during the time a breaker is open, some functionality is unavailable (unless you have a good fallback), which might degrade the user experience – but arguably better a degraded experience than a total outage. Rate Limiting and Throttling Sometimes the “failure” you need to handle is simply too much traffic. Services can become overwhelmed not just by internal failures, but by clients suddenly sending far more requests than the system can handle (whether due to a spike or a bug or malicious usage). Rate limiting is a strategy to curb the rate of incoming requests to a service, so that it doesn’t overload and crash. Essentially, you allow only a certain number of requests per second (or minute) and either queue or reject excess requests beyond that rate. In practice, rate limiting often uses algorithms like token buckets or leaky buckets to smooth out bursts. You might implement it at an API gateway or load balancer in front of your service, or within the service code. For example, an API might enforce that each user can only make 100 requests per minute – ensuring fairness and protecting the backend from abuse. In a microservices context, you might also limit how quickly one service can bombard another. By shedding excess load, the service can focus on handling a manageable volume of requests, thus maintaining overall system stability . This is sometimes called load shedding – under extreme load, prefer to serve fewer requests quickly and correctly, rather than trying to serve everything and failing across the board. It’s better for some users to get a “please try again later” than for all users to experience a crash. Pros: Rate limiting helps keep important services from getting overloaded. It ensures graceful degradation under high load by dropping or deferring less important traffic . It also improves fairness (one noisy client can’t starve others) and can be used to enforce quotas or SLAs. Cons: If not tuned, rate limiting can inadvertently reject legitimate traffic, hurting user experience. It also requires a strategy for what to do with excess requests (do you buffer them in a queue? Simply drop them?). In distributed systems, implementing a global rate limit is hard – you might have to coordinate counters across multiple instances. There’s also the question of which requests to shed first (some systems prioritize based on importance – e.g. drop reporting or analytics calls before user-facing ones). Bulkhead Isolation The term “bulkhead” comes from ship design – compartments in a ship ensure that if one section floods, it doesn’t sink the entire vessel. In microservices, the Bulkhead Pattern means isolating resources for each component or functionality so that a failure in one does not incapacitate others . Practically, this could mean allocating separate thread pools or connection pools for different downstream calls, or running certain services on dedicated hardware or Kubernetes nodes. For example, imagine a service that calls two external APIs: one for payments and one for notifications. Without bulkheads, if the payment API hangs and all your service threads get stuck waiting on it, your service might exhaust its thread pool and be unable to even serve the (independent) notifications functionality. If you apply bulkheading, you could have one thread pool (or goroutine pool) for payment calls and a separate one for notification calls. Then a slowdown in payments will only exhaust the payment pool, while the notification pool is still free to process its tasks . In essence, you partition your system’s resources by concern, so that one failing part can’t exhaust everything. Another example at the infrastructure level is running multiple instances of a microservice across different nodes or availability zones. That way, if one node has an issue, not all instances of the service are wiped out. Kubernetes can help here with anti-affinity rules (to not co-locate all replicas on one node) and PodDisruptionBudgets (to ensure a minimum number of pods remain up during maintenance). Bulkhead pattern can also mean having dedicated caches or databases per service, so one service’s DB overload doesn’t slow down another. Pros: Bulkheads provide strong fault isolation . Each service or component gets its own sandbox of resources – if it fails or overloads, others continue running unaffected. This prevents cascade failures and increases overall system robustness. It’s especially useful to protect critical services from less critical ones (or from noisy neighbors in multi-tenant scenarios). Cons: The downside is resource fragmentation. If you dedicate, say, 5 threads exclusively to payments and 5 to notifications, what if payments are idle but notifications have a huge backlog? You might be under-utilizing resources that could have been used elsewhere. Bulkheading requires you to forecast and allocate capacity per segment of your system, which isn’t always straightforward. It also adds configuration complexity – separate pools, separate deployments, etc. Still, in systems where reliability is paramount, accepting some inefficiency in exchange for isolation is usually worth it. Graceful Degradation and Fallbacks Despite all these protective measures, sometimes a part of your system will fail outright or become unavailable. Graceful degradation means designing your application to degrade gracefully when that happens, rather than catastrophically crashing or returning cryptic errors. In practice, this involves implementing fallbacks – alternative code paths or default behaviors that activate when a dependency is unavailable . For example, if your personalized recommendations service is down, your e-commerce site could fall back to showing best-selling products (a generic list) instead of personal recommendations. Users still see product suggestions – not as tailored, but better than an error message or an empty section. Similarly, if an optional microservice like a user review service is slow, you might time out and simply display the page without reviews, perhaps with a note like “Reviews are unavailable at the moment.” The key is the core functionality continues to work, and users are informed subtly that a feature is degraded rather than the entire app failing. Graceful degradation often goes hand-in-hand with the other patterns: circuit breakers trigger the fallback logic when tripped, and bulkheads ensure other parts can still function. Feature toggles can also help – you can automatically disable certain non-critical features in an outage to reduce load. Caching can also help with graceful degradation—by serving stored data when the live source isn’t available. Pros: This pattern maintains partial functionality for the user . Rather than an all-or-nothing system, you prioritize what’s most important and ensure that remains available. It improves user experience by at least providing a meaningful response or page, even if some features are absent. It also buys you time – if users can still do the primary tasks, your team can work on the outage for the secondary service without fire-fighting a total production outage. Cons: Implementing fallbacks increases development effort – you basically need to handle double logic (the primary path and the secondary path). Not everything has an obvious fallback; some features might just have to be offline. And if you rely too much on degraded mode, you might miss that a dependency is down (so make sure to still alert on it!). But overall, the cons are minimal compared to the reliability gained by thoughtful graceful degradation. To summarize these patterns: each one addresses a specific failure mode, and they are even more powerful when combined. For instance, you might set a timeout on a request, retry a couple times with backoff, then if still failing trigger the circuit breaker which cuts off calls and uses a fallback – all while other parts of the system remain safe thanks to bulkheads and rate limits. Next, we’ll compare these strategies side-by-side and then dive into observability and how to monitor all this in Kubernetes. Comparing Resilience Strategies The following table compares the key resilience strategies – retries, circuit breakers, timeouts, rate limiting, and bulkheads – highlighting their benefits, drawbacks, and best-fit scenarios: Resilience Strategy Pros Cons Best Use Cases Retries (with backoff) – Simple to implement; addresses transient faults effectively– Increases success rate for momentary glitches (network blips, etc.) – Can increase load if overused (risk of retry storms)– Adds latency if many retries; requires idempotent operations – Handling intermittent failures (e.g. brief network timeouts)– Use in combination with timeouts and only for operations safe to repeat Circuit Breakers – Prevent cascade failures by halting calls to bad components – Fail-fast improves user experience during outages (no long hangs)– Allows downstream service time to recover – Introduces complexity (must manage states and thresholds)– Misconfiguration can trigger too often or not enough– While open, some functionality is unavailable (need fallbacks) – Calling external or unstable services where failures can spike– Protecting critical pathways from repeated failures in dependencies– Use when you can provide a reasonable default response on failure Timeouts – Prevent system hanging on slow responses – Frees resources promptly, enabling quicker recovery or retries– Essential for keeping threads from blocking indefinitely – Choosing the right timeout is hard (too low = false fail, too high = slow)– A timeout still results in an error unless paired with retry/fallback– If misused, can cut off genuinely slow but needed responses – All external calls should have a timeout (baseline best practice)– Use shorter timeouts on interactive user requests, slightly longer for background tasks– Tune based on observed response times and SLOs (e.g. 95th percentile latency) Rate Limiting – Shields services from overload by shedding excess load – Ensures fair usage (no single client can hog resources)– Can maintain overall system responsiveness under high load – Legitimate traffic might be dropped if limits are too strict– Needs coordination in distributed systems (to enforce global limits)– Users may see errors or throttling responses if hitting the limit – Public APIs or multi-tenant systems to enforce quotas– Protecting downstream services that have fixed capacity or costly operations– As a form of “load shedding” during traffic spikes (serve most users normally, shed the rest) Bulkheads – Strong fault isolation: one service’s failure can’t take down others – Preserves resources for critical functions even if others fail– Improves stability in complex systems by containing failures – Can lead to under-utilization (static partitioning of resources)– More configuration (managing separate pools, deployments, etc.)– Hard to predict optimal resource split among components – Separating critical vs non-critical workloads (e.g. dedicate capacity for core features)– Any scenario where one bad actor or heavy load could starve others (thread pools per client or function)– Deploying redundant instances across different nodes/zones to avoid single points of failure As the table suggests, each strategy has a role to play. Often, you will combine several of them to cover different failure scenarios. For example, a well-designed microservice might include timeouts for all outbound calls, retries to handle temporary failures, a circuit breaker to prevent cascading issues during persistent outages, bulkhead isolation to keep one part of the system from overwhelming the rest, and rate limiting on incoming requests to protect against overload. These patterns complement each other. The end goal is a service that remains responsive and correct, delivering at least a baseline service to the customer even when dependencies misbehave or the system is under stress. Now that we’ve covered how to build fault tolerance into the code and architecture, let’s turn to observability – how do we ensure we can monitor and alert on all these moving parts? Observability and Monitoring in Practice Building a fault-tolerant system is not just about writing resilient code; it’s also about having excellent visibility into the system’s behavior at runtime. Observability is about instrumenting the system so that you can answer the question “what’s going on inside?” by examining its outputs (logs, metrics, traces). In the context of SRE practices, observability and monitoring are what enable us to enforce reliability targets and catch issues before they impact customers. Let’s discuss how to integrate observability into your microservices and Kubernetes deployments. Health Checks and Kubernetes Probes One of the simplest but most effective observability mechanisms is the health check. A health check is an endpoint or function that reports whether a service is healthy (e.g., can connect to its database, has its necessary conditions met, etc.). Health checks are used by Kubernetes through liveness and readiness probes to automate self-healing and rolling updates: Liveness Probe: Tells Kubernetes if the container is still alive and working. If the liveness check fails (for example, your app’s health endpoint is unresponsive or returns an unhealthy status), Kubernetes will assume the container is deadlocked or crashed and will restart it . This is great for automatically recovering from certain failures – your app will be restarted without human intervention, improving uptime. Readiness Probe: This tells Kubernetes whether your application is ready to serve requests. If the probe fails, the pod isn’t shut down—instead, Kubernetes simply stops routing traffic to it by removing it from the service endpoints. The container keeps running in the background, giving it time to recover and become ready again without being restarted. This allows an app to signal “I’m not OK to serve right now” (e.g., during startup, or if it lost connection to a dependency) so that no requests are routed to it until it’s ready again. This prevents sending traffic to a broken instance. You can implement probes as simple HTTP endpoints (like /healthz for liveness, /ready for readiness) or even just a TCP socket check. Here’s an example of how to configure liveness and readiness probes in a Kubernetes Deployment YAML: YAML containers: - name: my-service image: my-service:latest ports: - containerPort: 8080 livenessProbe: httpGet: path: /healthz # health endpoint for liveness port: 8080 initialDelaySeconds: 30 periodSeconds: 15 timeoutSeconds: 5 readinessProbe: httpGet: path: /ready # ready endpoint for readiness port: 8080 initialDelaySeconds: 5 periodSeconds: 10 timeoutSeconds: 3 In this snippet, Kubernetes will start checking the liveness of the pod 30 seconds after it starts, hitting /healthz. If that endpoint fails too many times, Kubernetes will kill and restart the container. The readiness check begins 5 seconds after the pod starts. Until the /ready endpoint responds successfully, the pod won’t receive any traffic from the service. These probes implement automatic self-healing and load balancing control—a cornerstone of SRE’s autonomous recovery principle (systems that fix themselves) . Beyond Kubernetes-level health checks, you should also consider application-level health. Many microservices include a health check that also verifies connections to crucial dependencies (database, message queue, etc.) and perhaps the status of recent operations. This can feed into readiness status – e.g., if DB is down, mark the app as not ready. Fail fast and stop taking requests if you know you can’t handle them properly. Metrics, Logging, and Tracing (The Three Pillars) A robust observability setup typically stands on three pillars: metrics, logs, and traces. Each provides a different view: Metrics are numeric measurements over time – things like request rates, error counts, latency percentiles, CPU usage, memory, etc. Metrics are key for monitoring trends and triggering alerts. For microservices, important metrics include HTTP request throughput, response times, error rates (e.g. number of 5xx responses), database query durations, queue lengths, and so on. SRE often focuses on the “four golden signals” of latency, traffic, errors, and saturation—for each service, you’d want to measure how long requests take, how many are coming through, how many are failing, and resource usage like CPU/memory or backlog. By instrumenting your code to record metrics (for instance, using Prometheus client libraries or OpenTelemetry metrics), you gain the ability to see how the system is performing internally. Logs are the sequence of events and messages that your services produce. They are indispensable for debugging. When something goes wrong (e.g., a circuit breaker trips or a request fails after all retries), the logs will contain the details of what happened. Aggregating logs from all microservices (using a centralized logging system like the ELK stack – Elasticsearch/Kibana – or cloud log solutions) is important so you can search across services. With microservices, a single user transaction might produce logs in multiple services – having them indexed with timestamps and maybe a trace ID (more on that next) will help reconstruct the story when investigating incidents. Distributed Tracing ties everything together by tracking the path of a request as it flows through multiple services. Each request gets a unique trace ID, and each service records spans (with timing) for its part of the work. This allows you to see, for example, that a user request to endpoint /checkout touched Service A (200ms), then called Service B (150ms of which 50ms was waiting on Service C), etc., and ultimately one component had a 500ms delay that slowed down the whole request. Tracing is extremely useful for pinpointing where in a complex chain a slowdown or error occurred. Tools like Zipkin, Jaeger, or OpenTelemetry’s tracing can be integrated into your microservices. Many frameworks will automatically propagate trace IDs (often via headers like X-Trace-ID) so that all logs and metrics can also be tagged with the trace, giving you a correlated view. Distributed tracing brings insight into cross-service interactions that metrics and logs alone might miss . Implementing these in Kubernetes: you might deploy a Prometheus server to scrape metrics from all services (each service exposes an HTTP /metrics endpoint). You’d set up a Grafana dashboard to visualize metrics and look at trends (e.g., request rate vs error rate over time). For logs, you could use Fluentd/Fluent Bit as a DaemonSet to collect container logs and send to Elasticsearch or a hosted logging service. For tracing, running a Jaeger agent or OpenTelemetry Collector on the cluster can gather spans from services and store them for analysis. Modern service meshes also often collect metrics and traces automatically for all traffic, which can be a shortcut to instrumenting every service. Conclusion Delivering reliable microservices from code to customer requires a blend of smart design, engineering discipline, and operational awareness. By implementing resilience patterns such as retries, circuit breakers, timeouts, rate limiting, and bulkheads, we can make our microservices robust against many types of failures. Techniques like graceful degradation ensure that even when something does break, users experience a degraded service rather than a total outage. Running these services on Kubernetes, we leverage features like liveness/readiness probes and auto-scaling for self-healing and responsiveness. At the same time, observability and SRE practices tie it all together—with thorough monitoring, logging, tracing, and alerting, we catch issues early and meet our reliability targets. As one author put it, building fault-tolerant microservices requires a combination of “redundancy, isolation, graceful degradation, monitoring, automated recovery, and rigorous testing,” all working in unison . By investing in these patterns and practices, you ensure that your microservices can withstand failures and still serve your customers. The road from code to customer is fraught with unexpected bumps—but with a fault-tolerant architecture and observability-driven operations, you can navigate it, delivering a smooth and reliable experience to users. In the end, resilience is not just a technical feature, but a customer feature: it keeps your service trustworthy and responsive, which is what ultimately keeps your users happy.

By Ravi Teja Thutari
Finding Needles in Digital Haystacks: The Distributed Tracing Revolution
Finding Needles in Digital Haystacks: The Distributed Tracing Revolution

It's 3 AM. Your phone buzzes with an alert. A critical API is responding slowly, with angry customer tweets already appearing. Your architecture spans dozens of microservices across multiple cloud providers. Where do you even begin? Without distributed tracing, you're reduced to: Checking individual service metrics, trying to guess which might be the culpritDigging through thousands of log lines across multiple servicesManually correlating timestamps to guess at request pathsHoping someone on your team remembers how everything connects But with distributed tracing in place, you can: See the entire request flow from frontend to database and backImmediately identify which specific service is introducing latencyPinpoint exact database queries, API calls, or code blocks causing the problemDeploy a targeted fix within minutes instead of hours As Ben Sigelman, co-creator of OpenTelemetry, puts it: "Distributed systems have become the norm, not the exception, and with that transition comes a new class of observability challenges." When your microservices architecture resembles a complex spider web, how do you track down that one frustrating bottleneck causing your customers pain? The Three Pillars of Observability Logs: Detailed records of discrete eventsMetrics: Aggregated numerical measurements over timeTraces: End-to-end request flows across distributed systems Charity Majors, CTO at Honeycomb, explains their relationship: "Metrics tell you something's wrong. Logs might tell you what's wrong. Traces tell you why and where it's wrong." What Is Distributed Tracing? Distributed tracing tracks requests as they propagate through distributed systems, creating a comprehensive picture showing: The path taken through various servicesTime spent in each componentDependency relationshipsFailure points and error propagation Each "span" in a trace represents a unit of work in a specific service, capturing timing information, metadata, and contextual logs. Real-World Impact: When Tracing Saves the Day Shopify's Black Friday Victory During Black Friday 2020, Shopify processed $2.9 billion in sales across their architecture of thousands of microservices. Jean-Michel Lemieux, former CTO, shared how distributed tracing helped them identify a database contention issue invisible in logs and metrics. The fix was deployed within minutes, avoiding potential millions in lost revenue. Uber's Mysterious Timeouts Uber encountered riders experiencing timeouts only in certain regions and times of day. Their traces revealed these issues occurred when requests routed through a specific API gateway with an authentication middleware component that became CPU-bound under specific conditions—a needle that would have remained hidden in their haystack without tracing. How Tracing Fits with Metrics and Logs The three pillars work best together in a complementary workflow: Metrics serve as your front-line defense, signaling when something's wrong. Logs provide detailed context about specific events. Traces connect the dots between services, revealing the "why" and "where." As Frederic Branczyk, Principal Engineer at Polar Signals, explains: "Metrics tell you something is wrong. Logs help you understand what's wrong. But traces help you understand why it's wrong." Getting Started with Distributed Tracing Step 1: Choose Your Framework OpenTelemetry (opentelemetry.io): The CNCF's vendor-neutral standard that's becoming the industry defaultJaeger (jaegertracing.io): A mature CNCF graduated project for end-to-end tracing Step 2: Instrument Your Code Modern frameworks provide automatic instrumentation for popular frameworks and libraries. Here's a simple example using OpenTelemetry in JavaScript: // Initialize OpenTelemetry const { trace } = require('@opentelemetry/api'); const tracer = trace.getTracer('my-service'); // Create a span for a critical operation async function processOrder(orderId) { const span = tracer.startSpan('process-order'); span.setAttribute('order.id', orderId); try { // Your business logic here await validateOrder(orderId); await processPayment(orderId); await shipOrder(orderId); span.setStatus({ code: SpanStatusCode.OK }); } catch (error) { span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }); span.recordException(error); throw error; } finally { span.end(); // Always remember to end the span! } } Step 3: Set Up Collection and Storage Several excellent options exist to collect and visualize your traces: Open-source: Jaeger, Zipkin, SigNozCommercial: Honeycomb, Datadog, New RelicCloud-native: AWS X-Ray, Google Cloud Trace, Azure Application Insights Step 4: Focus on Meaningful Data Start with critical paths and high-value transactions. Add business context through tags like customer IDs and transaction types. The OpenTelemetry Semantic Conventions provide excellent guidance on what to instrument. Step 5: Start Small, Then Expand Begin with a pilot project before scaling across your architecture. Many teams start by instrumenting their API gateway and one critical downstream service to demonstrate value. Common Pitfalls to Avoid Excessive Data Collection: Leading to high costs and noisePoor Sampling: Missing critical issuesInadequate Context: Not capturing enough business informationIncomplete Coverage: Missing key services or dependenciesSiloed Analysis: Failing to connect traces with metrics and logs The Future of Distributed Tracing Watch for these emerging trends: AI-powered anomaly detectionContinuous profiling integrationEnhanced privacy controlseBPF-based instrumentationBusiness-centric observability Conclusion: From Haystack to Clarity In today's complex distributed systems, finding the root cause of performance issues can feel like searching for a needle in a haystack. Distributed tracing transforms this process by illuminating the entire request journey. Tracing is not optional for serious distributed systems. While logs and metrics remain essential, they simply cannot provide the end-to-end visibility that modern architectures demand. Without distributed tracing, you're operating with a dangerous blind spot—seeing symptoms without understanding root causes, detecting failures without understanding their propagation paths. End-to-end observability requires all three pillars working together: Metrics to detect problemsLogs to understand detailsTraces to connect everything and show the complete picture As Cindy Sridharan, author of "Distributed Systems Observability," wrote: "The best time to implement tracing was when you built your first microservice. The second-best time is now." Your future self—especially the one getting paged at 3 AM—will thank you. Don't wait for the next production crisis to start your tracing journey.

By Rishab Jolly
Secure IaC With a Shift-Left Approach
Secure IaC With a Shift-Left Approach

Imagine you're building a skyscraper—not just quickly, but with precision. You rely on blueprints to make sure every beam and every bolt is exactly where it should be. That’s what Infrastructure as Code (IaC) is for today’s cloud-native organizations—a blueprint for the cloud. As businesses race to innovate faster, IaC helps them automate and standardize how cloud resources are built. But here’s the catch: speed without security is like skipping the safety checks on that skyscraper. One misconfigured setting, an exposed secret, or a non-compliant resource can bring the whole thing down—or at least cause serious trouble in production. That’s why the shift-left approach to secure IaC matters more than ever. What Does “Shift-Left” Mean in IaC? Shifting left refers to moving security and compliance checks earlier in the development process. Rather than waiting until deployment or runtime to detect issues, teams validate security policies, compliance rules, and access controls as code is written—enabling faster feedback, reduced rework, and stronger cloud governance. For IaC, this means, Scanning Terraform templates and other configuration files for vulnerabilities and misconfigurations before they are deployed.Validating against cloud-specific best practices.Integrating policy-as-code and security tools into CI/CD pipelines. Why Secure IaC Matters? IaC has completely changed the game when it comes to managing cloud environments. It’s like having a fast-forward button for provisioning—making it quicker, more consistent, and easier to repeat across teams and projects. But while IaC helps solve a lot of the troubles around manual operations, it’s not without its own set of risks. The truth is, one small mistake—just a single misconfigured line in a Terraform script—can have massive consequences. It could unintentionally expose sensitive data, leave the door open for unauthorized access, or cause your setup to drift away from compliance standards. And because everything’s automated, those risks scale just as fast as your infrastructure. In cloud environments like IBM Cloud, where IaC tools like Terraform and Schematics automate the creation of virtual servers, networks, storage, and IAM policies, a security oversight can result in- Publicly exposed resources (e.g., Cloud Object Storage buckets or VPC subnets).Over-permissive IAM roles granting broader access than intended.Missing encryption for data at rest or in transit.Hard-coded secrets and keys within configuration files.Non-compliance with regulatory standards like GDPR, HIPAA, or ISO 27001. These risks can lead to data breaches, service disruptions, and audit failures—especially if they go unnoticed until after deployment. Secure IaC ensures that security and compliance are not afterthoughts but are baked into the development process. It enables: Early detection of mis-configurations and policy violations.Automated remediation before deployment.Audit-ready infrastructure, with traceable and versioned security policies.Shift-left security, empowering developers to code safely without slowing down innovation. When done right, Secure IaC acts as a first line of defense, helping teams deploy confidently while reducing the cost and impact of security fixes later in the lifecycle. Components of Secure IaC Framework The Secure IaC Framework is structured into layered components that guide organizations in embedding security throughout the IaC lifecycle. Building Blocks of IaC (Core foundation for all other layers)—These are the fundamental practices required to enable any Infrastructure as Code approach. Use declarative configuration (e.g. Terraform, YAML, JSON).Embrace version control (e.g. Git) for all infrastructure code.Define idempotent and modular code for reusable infrastructure.Enable automation pipelines (CI/CD) for repeatable deployments.Follow consistent naming conventions, tagging policies, and code linting.Build Secure Infrastructure- Focuses on embedding secure design and architectural patterns into the infrastructure baseline. Use secure-by-default modules (e.g. encryption, private subnets).Establish network segmentation, IAM boundaries, and resource isolation.Configure monitoring, logging, and default denial policies.Choose secure providers and verified module sources.Automate Controls - Empowers shift-left security by embedding controls into the development and delivery pipelines. Run static code analysis (e.g. Trivy, Checkov) pre-commit and in CI.Enforce policy-as-code using OPA or Sentinel for approvals and denials.Integrate configuration management and IaC test frameworks (e.g. Terratest).Detect & Respond - Supports runtime security through visibility, alerting, and remediation.Enable drift detection tools to track deviations from IaC definitions.Use runtime compliance monitoring.Integrate with SOAR platforms or incident playbooks.Generate security alerts for real-time remediation and Root Cause Analysis (RCA).Detect & Respond - Supports runtime security through visibility, alerting, and remediation. Enable drift detection tools to track deviations from IaC definitions.Use runtime compliance monitoring (e.g., IBM Cloud SCC).Integrate with SOAR platforms or incident playbooks.Generate security alerts for real-time remediation and RCA.Design Governance—Establishes repeatable, scalable security practices across the enterprise. Promote immutable infrastructure for consistent and tamper-proof environments.Use golden modules or signed templates with organizational guardrails.Implement change management via GitOps, PR workflows, and approval gates.Align with compliance standards (e.g., CIS, NIST, ISO 27001) and produce audit reports. Anatomy of Secure IaC Creating a secure IaC environment involves incorporating several best practices and tools to ensure that the infrastructure is resilient, compliant, and protected against potential threats. These practices are implemented and tracked at various phases of IaC environment lifecycle. Design phase of IaC involves not just identifying the IaC script design and tools decision but also includes the design of incorporating organizational policies into the IaC scripts.Development phase of IaC involves the coding best practices, implementing IaC scripts and policies involved, and also the pre-commit checks that the developer can run before committing. These checks help a clean code check-in and detect the code smells upfront.Build phase of IaC involves all the code security checks and policy verification. This is a quality gate in the pipeline that stops the deployment on any failures.Deployment phase of IaC supports deployment to various environments along with their respective configurations.Maintenance phase of IaC is also a crucial phase, as threat detection, vulnerability detection, and monitoring play a key role. Key Pillars of Secure IaC Below is a list of key pillars of Secure IaC, incorporating all the essential tools and services. These pillars align with cloud-native capabilities to enforce a secure-by-design, shift-left approach for Infrastructure as Code: Reference templates like Deployable Architectures or AWS Terraform Modules. Reusable, templatized infrastructure blueprints designed for security, compliance, and scalability.Promotes consistency across environments (dev/test/prod).Often include pre-approved Terraform templates.Managed IaC platformsllike IBM Cloud Schematics or AWS CloudFormation Enables secure execution of Terraform code in isolated workspaces.Supports: Role-Based Access Control (RBAC)Encrypted variablesApproval workflows (via GitOps or manual)Versioned infrastructure plansLifecycle resource management using IBM Cloud Projects or Azure Blueprints Logical grouping of cloud resources tied to governance and compliance requirements.Simplifies multi-environment deployments (e.g. dev, QA, prod).Integrates with IaC deployment and CI/CD for isolated, secure automation pipelines.Secrets Management Centralized secrets vault to manage: API keysCertificatesIAM credentialsProvides dynamic secrets, automatic rotation, access logging, and fine-grained access policies.Key Management Solutions (KMS/HSM) Protect sensitive data at rest or in transit Manages encryption keys with full customer control and auditability.KMS-backed encryption is critical for storage, databases, and secrets.Compliance Posture Management Provides posture management and continuous compliance monitoring.Enables: Policy-as-Code checks on IaC deploymentsCustom rules enforcementCompliance posture dashboards (CIS, NIST, GDPR)Introduce Continuous Compliance (CC) pipelines as part of the CI/CD pipelines for shift-left enforcement.CI/CD Pipelines (DevSecOps) Integrate security scans and controls into delivery pipelines using GitHub Actions, Tekton, Jenkins, or IBM Cloud Continuous DeliveryPipeline stages include: Terraform lintingStatic analysis (Checkov, tfsec)Secrets scanningCompliance policy validationChange approval gates before Schematics applyPolicy-as-Code Use tools like OPA (Open Policy Agent) policies to: Block insecure resource configurationsRequire tagging, encryption, and access policiesAutomate compliance enforcement during plan and applyIAM & Resource Access Governance Apply least privilege IAM roles for projects, and API keys.Use resource groups to scope access boundaries.Enforce fine-grained access to Secrets Manager, KMS, and Logs.Audit and Logging Integrate with Cloud Logs to: Monitor infrastructure changesAudit access to secrets, projects, and deploymentsDetect anomalies in provisioning behaviorMonitoring and Drift Detection Use monitoring tools like IBM Instana, Drift Detection, or custom Terraform state validation to: Continuously monitor deployed infrastructureCompare live state to defined IaCRemediate unauthorized changes Checklist: Secure IaC 1. Code Validation and Static Analysis Integrate static analysis tools (e.g., Checkov, TFSec) into your development workflow. Scan Terraform templates for misconfigurations and security vulnerabilities. Ensure compliance with best practices and CIS benchmarks. 2. Policy-as-Code Enforcement Define security policies using Open Policy Agent (OPA) or other equivalent tools. Enforce policies during the CI/CD pipeline to prevent non-compliant deployments. Regularly update and audit policies to adapt to evolving security requirements. 3. Secrets and Credential Management Store sensitive information in Secrets Manager. Avoid hardcoding secrets in IaC templates. Implement automated secret rotation and access controls. 4. Immutable Infrastructure and Version Control Maintain all IaC templates in a version-controlled repository (e.g., Git). Implement pull request workflows with mandatory code reviews. Tag and document releases for traceability and rollback capabilities. 5. CI/CD Integration with Security Gates Incorporate security scans and compliance checks into the CI/CD pipeline. Set up approval gates to halt deployments on policy violations. Automate testing and validation of IaC changes before deployment. 6. Secure Execution Environment Utilize IBM Cloud Schematics or AWS Cloud Formation or any equivalent tool for executing Terraform templates in isolated environments. Restrict access to execution environments using IAM roles and policies. Monitor and log all execution activities for auditing purposes. 7. Drift Detection and Continuous Monitoring Implement tools to detect configuration drift between deployed resources and IaC templates. Regularly scan deployed resources for compliance. Set up alerts for unauthorized changes or policy violations. Benefits of Shift-Left Secure IaC Here are the key benefits of adopting Shift-Left Secure IaC, tailored for cloud-native teams focused on automation, compliance, and developer enablement: Early Risk Detection and RemediationFaster, More Secure DeploymentsAutomated Compliance EnforcementReduced Human Error and Configuration DriftImproved Developer ExperienceEnhanced Auditability and TraceabilityReduced Cost of Security FixesStronger Governance with IAM and RBACContinuous Posture Assurance Conclusion Adopting a shift-left approach to secure IaC in cloud platforms isn’t just about preventing mis-configurations—it’s about building smarter from the start. When security is treated as a core part of the development process rather than an afterthought, teams can move faster with fewer surprises down the line. With cloud services like Schematics, Projects, Secrets Manager, Key Management, Cloud Formation, and Azure Blueprints, organizations have all the tools they need to catch issues early, stay compliant, and automate guardrails. However, the true benefit extends beyond security—it establishes the foundation for platform engineering. By baking secure, reusable infrastructure patterns into internal developer platforms, teams create a friction-less, self-service experience that helps developers ship faster without compromising governance.

By Josephine Eskaline Joyce DZone Core CORE
OTel Me Why: The Case for OpenTelemetry Beyond the Shine
OTel Me Why: The Case for OpenTelemetry Beyond the Shine

My blog on pricing from the other day caught the attention of the folks over at MetricFire, and we struck up a conversation about some of the ideas, ideals, and challenges swirling around monitoring, observability, and its place in the broader IT landscape. At one point, JJ, the lead engineer, asked, “You blogged about gearing up to get a certification in Open Telemetry. What is it about OTel that has you so excited?” I gave a quick answer, but JJ’s question got me thinking, and I wanted to put some of those ideas down here. OTel Is the Best Thing Since… Let me start by answering JJ’s question directly: I find Open Telemetry exciting because it’s the biggest change in the way monitoring and observability are done since Traces (which came out around 2000, but wasn’t widely used until 2010-ish). And Traces were the biggest change since… ever. Let me explain. See this picture? This was what it was like to use monitoring to understand your environment back when I started almost 30 years ago. What we wanted was to know what was happening in that boat. But that was never an option. We could scrape metrics together from network and OS commands, we could build some scripts and db queries that gave me a little bit more insight. We could collect and (with a lot of work) aggregate log messages together to spot trends across multiple systems. All of that would give us an idea of how the infrastructure was running, and infer the things that might be happening topside. But we never really knew. Tracing changed all that. All of a sudden we could get hard data (and get it in real time) about what users were doing, and what was happening in the application when they did it. It was a complete sea change (pun intended) for how we worked and what we monitored. Even so, tracing didn’t remove the need for metrics and logs. And famous (or infamous) “three pillars” of observability. Recently I started working through the book “Learning OpenTelemetry” and one of the comments that struck me was that these aren’t “three pillars” in the sense that they don’t combine to hold up a unified whole. Authors Ted Young and Austin Parker re-framed the combination of Metrics, Logs, and Traces as “The three browser tabs of observability” because many tools put the effort back on the user to to flip between screens and put it all together by sight. On the other hand, OTel outputs is able to present all three streams of data as a single “braid.” From Learning OpenTelemetry, by Ted Young and Austin Parker, Copyright © 2024. Published by O’Reilly Media, Inc. Used with permission. It should be noted that despite OTel’s ability to combine and correlate this information, the authors of the book point out later that many tools still lack the ability to present it that way. Despite it being a work in progress (but what, in the world of IT, isn’t?), I still feel that OTel has already proven its potential to change the face of monitoring and observability. OTel is the Esperanto of Monitoring Almost every vendor will jump at the chance to get you to send all your data to them. They insist that theirs is the One True Observability Tool. In fact, let’s get this out in the open: There simply isn’t a singular “best” monitoring tool out there any more than there’s one singular “best” programming language, or car model, or pizza style.* There isn’t a single tool which will cover 100% of your needs in every single use case. And for the larger tools, even the use cases that aren’t part of their absolute sweet spot are going to cost you (in terms of hours or dollars) to get right. So you’re going to have multiple tools. It goes without saying (or at least it should) that you’re not going to ship a full copy of all your data to multiple vendors. Therefore a big part of your work as a monitoring engineer (or team of engineers) is to map your telemetry to the use cases they support, and thus to the tools you need to employ in those use cases. That’s not actually a hard problem. Sure, it’s complex, but once you have the mapping, making it happen is relatively easy. But, as I like to say, it’s not the cost to buy the puppy that is the problem, it’s the cost to keep feeding it. Because the tools you have today are going to change down the road. That’s when things get CRAZY hard. You have to hope things are documented well enough to understand all those telemetry-to-use-case mappings. (Narrator: they will not, in fact, have it documented well enough) Then you have to also hope your instrumentation is documented and understood well enough to know how to de-couple tool x and instrument tool y such that you maintain the same capabilities. (Narrator: this is not how it will go down.) But OTel solves both the “buying the puppy” and the “feeding the puppy” problem. My friend Matt Macdonald-Wallace (Solutions Architect at Grafana) put it like this: OTEL does solve a lot of the problems around ‘Oh great! now we’re trapped with vendor x and it’s going to cost us millions to refactor all this code’ as opposed to ‘Oh, we’re switching vendors? Cool, let me just update my endpoint…’ Not only that, but OTel's ability to create pipelines (for those who are not up to speed on that concept, it’s the ability to identify, filter, sample, and transform a stream of data before sending it to a specific destination) means you can send the same data stream to multiple locations selectively. Meaning your security team can get their raw unfiltered syslog while it’s still on-premises. Some of the data – traces, logs, and/or metrics – can go to one or more vendors. Which is why I say: OTel is the esperanto of observability. OTel’s Secret Sauce Isn’t OTLP … it’s standardization. Before I explain why the real benefit to Otel is not OTLP, I should take a second to explain what OTLP is: If you look up “What is Open Telemetry Line Protocol?” you’ll probably find some variation of “... a set of standards, rules, and/or conventions that specify how OTel elements send data from the thing that created it to a destination.” This is technically true, but also not very helpful. Functionally, OTLP is the magic box that takes metrics, logs, or traces and sends them where they need to go. It’s not as low level as, say, TCP, but in terms of how it changes a monitoring engineer’s day, it may as well be. We don’t use OTLP so much as we indicate it should be used. Just to be clear, OTLP is amazingly cool and important. It’s just not (in my opinion) AS important as some other aspects. No, there are (at least) two things that, in my opinion, make OTel such an evolutionary shift in monitoring: Collectors First, it standardizes the model of having a 3-tier, collector (not agent) in the middle, architecture. For us old-timers in the monitoring space, the idea of a collector is nothing new. In the bygone era of everything-on-prem, you couldn’t get away with a thousand (or even a hundred) agents all talking to some remote destination. The shift to cloud architecture changed all that, but it’s still not the best idea. Having a single (or small number) of load-balanced systems that take all the data from multiple targets—with the added benefit of being able to then process that data, filtering, sampling, combining, etc—before sending it forward is not just A Good Idea™, it can have a direct impact on your bottom line by only sending the data you WANT (and in the form you want it) out the egress port that racks up such a big part of your monthly bill. Semantics Look, I’ll be the first to tell you that I’m not the world’s best developer. So the issue of semantic terminology doesn’t usually keep me up at night. What DOES keep me up is the inability to get at a piece of data that I know should be there, but isn’t. What I mean is that it’s fairly common that the same data point—say bandwidth—is referred to by a completely different name and location on devices from two different vendors. And maybe that doesn’t seem so weird. But how about the same data point being different on two different types of devices from the same vendor? Still not weird? Let’s talk about the same data point being different on the same device type from the same vendor, but two different models? Getting weird, right (not to mention annoying). But the real kicker is when the same data point is different on two different parts of the same DEVICE. Once you’ve run down that particular rabbit whole, you have a whole different appreciation for semantic naming. If I’m looking for CPU or bandwidth or latency or whatever, I would really REALLY like for it to be called the same thing and be found in the semantically same location. OTel does this, and does it as a core aspect of the platform. I’m not the only one to have noticed it, either. Several years ago, during a meeting between the maintainers of Prometheus and OpenTelemetry, an unnamed Prometheus maintainer quipped, “You know, I’m not sure about the rest of this, but these semantic conventions are the most valuable thing I’ve seen in a while.” It may sound a bit silly, but it’s also true. From Learning OpenTelemetry, by Ted Young and Austin Parker, Copyright © 2024. Published by O’Reilly Media, Inc. Used with permission. Summarizing the Data I’ll admit that OpenTelemetry is still very (VERY) shiny for me. But I’ll also admit that the more I dig into it, the more I find to like. Hopefully this blog has given you some reasons to check out OTel, too. * OK, I lied. 1) Perl 2) The 1967 Ford Mustang 390 GT/A and 3) deep dish from Tel Aviv Kosher Pizza in Chicago

By Leon Adato
It Costs That Much Because Observability Takes Hours
It Costs That Much Because Observability Takes Hours

Today’s blog title is inspired by this song, "It Costs That Much." My daughter started singing it regularly after she opened her bakery. Read on for details on that story, and how it relates to observability. I thought of it, and my daughter’s reasons for singing it, after a few responses to my recent blog post, "Observability Expenses: When ‘Pennies on the Dollar’ Add Up Quickly." It touched a nerve, which was nice to see. This is an important, nuanced, and complex conversation. I believe that getting folks involved in this conversation is better for everyone. Some of the thoughtful responses included this one and this one. But apparently for some folks, it touched the WRONG nerve. People misunderstood the point I was trying to make. Some people (and I’m not linking to those posts) took it as a reason to bash vendors for charging anything at all. To posit that all software (and especially all monitoring software) should be free. I wasn’t saying that monitoring tools should be free. I wasn’t even saying that vendors are charging too much. I was saying that if there’s no plan for the observability data we collect, ANY cost seems like too much. The problem is in our lack of clear goals, not the cost of services. If there’s no plan, you can’t manage (or even estimate) the costs. Let me explain by going back to the title of the blog, and the song that inspired it. My daughter tends to sing the catchy little ditty after one of the (unfortunately many) calls with a prospective customer, who tells her – with all the confidence of someone who’s baking expertise extends only as far as opening a box of Duncan Hines cake mix – that her cakes are way too expensive, because, and I quote: “anyone could make them.” To her credit, my daughter remains professional (and calm, which is more than I would have managed). “Free” Isn’t (Always) the Right Answer Monitoring and observability are ferociously difficult to get right, and the people who devote their time (and sometimes careers) to making it work – to say nothing about making it better – deserve all of our respect. They also deserve to be paid for their efforts. Like those folks who call my daughter’s bakery, you know who typically thinks they shouldn’t have to pay for observability? People who have only the most tenuous idea of what goes into it and how it works. For folks at the forefront of observability, like Liz, this is nothing new. There are those who will be quick to say, “But Leon, open source tools like Grafana are free! Why doesn’t everyone else do that?” OK, but hear me out. While the phrase “Linux is only free if your time has no value” is far less true than when Jamie Zawinski wrote it back in 1998, the spirit of it still holds. A thing is only as free as the real total cost of it—inclusive of time, effort, and money. Lest you think I’m working my way toward the old “Car Triangle” conundrum. …that’s NOT actually my point. It actually IS possible to have all three. My daughter can throw together a cake, or a loaf of challah, or a batch of cookies faster, better, and cheaper than she can buy them. But of course, that’s only if you discount all the time it took her to get to this point. I won’t waste your time sharing that complete (and oft-repeated) story of the itemized $10,000 repair bill which read: Breakdown of Charges: Turning the screw: $1 Knowing which screw to turn: $9,999 The point stands—the only cheat code for the car triangle dilemma is if you place yourself in the middle of it, bridging the gaps. As the late 90s commercial says, “For everything else, there’s MasterCard.” Not All Observability Data Is Equal So back to my original point—my post recently was not trying to say observability should be free, or that vendors are charging too much. Re-iterating my main point: if there isn’t a meaningful plan for the observability data you collect, any cost is too high. The corollary is that if you DO have a meaningful plan, you will also know what it’s worth to you to collect, process, and store it. But nothing in our lives as IT practitioners is ever as easy as coming up with a single plan for all our data and calling it a day. Different data types have different value propositions, and some of those value profiles are themselves variable based on outside factors like time of year (i.e. Black Friday). Or what has just happened. Cribl understood this all the way back in 2022, when they framed this idea succinctly: “As it happens, all of this high-fidelity data is completely worthless until it comes time for forensics, where it turns out to be invaluable. You have absolutely no need for such a high level of detail until the exact moment when you do need it – and you actually can’t do anything without it.” To expand on that idea a little: I don’t think the nice folks at Cribl are saying that lo-fi data is fine for daily ops monitoring and observability, and you only need high-fi data until you’re in the middle of a retro. Instead, I think it’s akin to something my friend Alec Isaacson, Solutions Architect at Grafana, once said: “Murphy’s law requires that you’re probably not collecting the data you need.” Unless, of course, you are. Cognitive (Data) Dissonance We don’t need all the data, until we do, and in that liminal moment it goes from worthless to desperately important. We don’t want to pay anything to collect, process, or store the data. Until there’s a crisis, at which point we’re willing to spare no expense. For the poor beleaguered monitoring engineer (and yes, I will die on the hill that this is a real job. Or should be.) this may feel like our professional Kobiyashi Maru scenario. But it’s not. I believe there’s a way to thread this needle, and in the next blog I’m going to explore exactly that, along with how OTel fits into all of this.

By Leon Adato
Managing Encrypted Aurora DAS Over Kinesis With AWS SDK
Managing Encrypted Aurora DAS Over Kinesis With AWS SDK

When it comes to auditing and monitoring database activity, Amazon Aurora's Database Activity Stream (DAS) provides a secure and near real-time stream of database activity. By default, DAS encrypts all data in transit using AWS Key Management Service (KMS) with a customer-managed key (CMK) and streams this encrypted data into a Serverless Streaming Data Service - Amazon Kinesis. While this is great for compliance and security, reading and interpreting the encrypted data stream requires additional effort — particularly if you're building custom analytics, alerting, or logging solutions. This article walks you through how to read the encrypted Aurora DAS records from Kinesis using the AWS Encryption SDK. Security and compliance are top priorities when working with sensitive data in the cloud — especially in regulated industries such as finance, healthcare, and government. Amazon Aurora's DAS is designed to help customers monitor database activity in real time, providing deep visibility into queries, connections, and data access patterns. However, this stream of data is encrypted in transit by default using a customer-managed AWS KMS (Key Management Service) key and routed through Amazon Kinesis Data Streams for consumption. While this encryption model enhances data security, it introduces a technical challenge: how do you access and process the encrypted DAS data? The payload cannot be directly interpreted, as it's wrapped in envelope encryption and protected by your KMS CMK. Understanding the Challenge Before discussing the solution, it's important to understand how Aurora DAS encryption works: Envelope Encryption Model: Aurora DAS uses envelope encryption, where the data is encrypted with a data key, and that data key is itself encrypted using your KMS key. Two Encrypted Components: Each record in the Kinesis stream contains: The database activity events encrypted with a data key The data key encrypted with your KMS CMK Kinesis Data Stream Format: The records follow this structure: JSON { "type": "DatabaseActivityMonitoringRecords", "version": "1.1", "databaseActivityEvents": "[encrypted audit records]", "key": "[encrypted data key]" } Solution Overview: AWS Encryption SDK Approach Aurora DAS encrypts data in multiple layers, and the AWS Encryption SDK helps you easily unwrap all that encryption so you can see what’s going on. Here's why this specific approach is required: Handles Envelope Encryption: The SDK is designed to work with the envelope encryption pattern used by Aurora DAS. Integrates with KMS: It seamlessly integrates with your KMS keys for the initial decryption of the data key. Manages Cryptographic Operations: The SDK handles the complex cryptographic operations required for secure decryption. The decryption process follows these key steps: First, decrypt the encrypted data key using your KMS CMK. Then, use that decrypted key to decrypt the database activity events.Finally, decompress the decrypted data to get the readable JSON output Implementation Step 1: Set Up Aurora With Database Activity Streams Before implementing the decryption solution, ensure you have: An Aurora PostgreSQL or MySQL cluster with sufficient permissions A customer-managed KMS key for encryption Database Activity Streams enabled on your Aurora cluster When you turn on DAS, AWS sets up a Kinesis stream called aws-rds-das-[cluster-resource-id] that receives the encrypted data. Step 2: Prepare the AWS Encryption SDK Environment For decrypting DAS events, your processing application (typically a Lambda function) needs the AWS Encryption SDK. This SDK is not included in standard AWS runtimes and must be added separately. Why this matters: The AWS Encryption SDK provides specialized cryptographic algorithms and protocols designed specifically for envelope encryption patterns used by AWS services like DAS. The most efficient approach is to create a Lambda Layer containing: aws_encryption_sdk: Required for the envelope decryption process boto3: Needed for AWS service interactions, particularly with KMS Step 3: Implement the Decryption Logic Here’s a Lambda function example that handles decrypting DAS events. Each part of the decryption process is thoroughly documented with comments in the code: Python import base64 import json import zlib import boto3 import aws_encryption_sdk from aws_encryption_sdk import CommitmentPolicy from aws_encryption_sdk.internal.crypto import WrappingKey from aws_encryption_sdk.key_providers.raw import RawMasterKeyProvider from aws_encryption_sdk.identifiers import WrappingAlgorithm, EncryptionKeyType # Configuration - update these values REGION_NAME = 'your-region' # Change to your region RESOURCE_ID = 'your cluster resource ID' # Change to your RDS resource ID # Initialize encryption client with appropriate commitment policy # This is required for proper operation with the AWS Encryption SDK enc_client = aws_encryption_sdk.EncryptionSDKClient(commitment_policy=CommitmentPolicy.FORBID_ENCRYPT_ALLOW_DECRYPT) # Custom key provider class for decryption # This class is necessary to use the raw data key from KMS with the Encryption SDK class MyRawMasterKeyProvider(RawMasterKeyProvider): provider_id = "BC" def __new__(cls, *args, **kwargs): obj = super(RawMasterKeyProvider, cls).__new__(cls) return obj def __init__(self, plain_key): RawMasterKeyProvider.__init__(self) # Configure the wrapping key with proper algorithm for DAS decryption self.wrapping_key = WrappingKey( wrapping_algorithm=WrappingAlgorithm.AES_256_GCM_IV12_TAG16_NO_PADDING, wrapping_key=plain_key, wrapping_key_type=EncryptionKeyType.SYMMETRIC ) def _get_raw_key(self, key_id): # Return the wrapping key when the Encryption SDK requests it return self.wrapping_key # First decryption step: use the data key to decrypt the payload def decrypt_payload(payload, data_key): # Create a key provider using our decrypted data key my_key_provider = MyRawMasterKeyProvider(data_key) my_key_provider.add_master_key("DataKey") # Decrypt the payload using the AWS Encryption SDK decrypted_plaintext, header = enc_client.decrypt( source=payload, materials_manager=aws_encryption_sdk.materials_managers.default.DefaultCryptoMaterialsManager( master_key_provider=my_key_provider) ) return decrypted_plaintext # Second step: decompress the decrypted data # DAS events are compressed before encryption to save bandwidth def decrypt_decompress(payload, key): decrypted = decrypt_payload(payload, key) # Use zlib with specific window bits for proper decompression return zlib.decompress(decrypted, zlib.MAX_WBITS + 16) # Main Lambda handler function that processes events from Kinesis def lambda_handler(event, context): session = boto3.session.Session() kms = session.client('kms', region_name=REGION_NAME) for record in event['Records']: # Step 1: Get the base64-encoded data from Kinesis payload = base64.b64decode(record['kinesis']['data']) record_data = json.loads(payload) # Step 2: Extract the two encrypted components payload_decoded = base64.b64decode(record_data['databaseActivityEvents']) data_key_decoded = base64.b64decode(record_data['key']) # Step 3: Decrypt the data key using KMS # This is the first level of decryption in the envelope model data_key_decrypt_result = kms.decrypt( CiphertextBlob=data_key_decoded, EncryptionContext={'aws:rds:dbc-id': RESOURCE_ID} ) decrypted_data_key = data_key_decrypt_result['Plaintext'] # Step 4: Use the decrypted data key to decrypt and decompress the events # This is the second level of decryption in the envelope model decrypted_event = decrypt_decompress(payload_decoded, decrypted_data_key) # Step 5: Process the decrypted event # At this point, decrypted_event contains the plaintext JSON of database activity print(decrypted_event) # Additional processing logic would go here # For example, you might: # - Parse the JSON and extract specific fields # - Store events in a database for analysis # - Trigger alerts based on suspicious activities return { 'statusCode': 200, 'body': json.dumps('Processing Complete') } Step 4: Error Handling and Performance Considerations As you implement this solution in production, keep these key factors in mind: Error Handling: KMS permissions: Ensure your Lambda function has the necessary KMS permissions so it can decrypt the data successfully.Encryption context: The context must match exactly (aws:rds:dbc-id) Resource ID: Make sure you're using the correct Aurora cluster resource ID—if it's off, the KMS decryption step will fail. Performance Considerations: Batch size: Configure appropriate Kinesis batch sizes for your Lambda Timeout settings: Decryption operations may require longer timeouts Memory allocation: Processing encrypted streams requires more memory Conclusion Aurora's Database Activity Streams provide powerful auditing capabilities, but the default encryption presents a technical challenge for utilizing this data. By leveraging the AWS Encryption SDK and understanding the envelope encryption model, you can successfully decrypt and process these encrypted streams. The key takeaways from this article are: Aurora DAS uses a two-layer envelope encryption model that requires specialized decryption The AWS Encryption SDK is essential for properly handling this encryption pattern The decryption process involves first decrypting the data key with KMS, then using that key to decrypt the actual events Proper implementation enables you to unlock valuable database activity data for security monitoring and compliance By following this approach, you can build robust solutions that leverage the security benefits of encrypted Database Activity Streams while still gaining access to the valuable insights they contain.

By Shubham Kaushik
The Truth About AI and Job Loss
The Truth About AI and Job Loss

I keep finding myself in conversations with family and friends asking, “Is AI coming for our jobs?” Which roles are getting Thanos-snapped first? And will there still be space for junior individual contributors in organizations? And many more. With so many conflicting opinions, I felt overwhelmed and anxious, so I decided to take action instead of staying stuck in uncertainty. So, I began collecting historical data and relevant facts to gain a clearer understanding of the direction and impact of the current AI surge. So, Here’s What We Know Microsoft reports that over 30% of the code on GitHub Copilot is now AI-generated, highlighting a shift in how software is being developed. Major tech companies — including Google, Meta, Amazon, and Microsoft — have implemented widespread layoffs over the past 18–24 months. Current generative AI models, like GPT-4 and CodeWhisperer, can reliably write functional code, particularly for standard, well-defined tasks.Productivity gains: Occupations in which many tasks can be performed by AI are experiencing nearly five times higher growth in productivity than the sectors with the least AI adoption.AI systems still require a human “prompt” or input to initiate the thinking process. They do not ideate independently or possess genuine creativity — they follow patterns and statistical reasoning based on training data.Despite rapid progress, today’s AI is still far from achieving human-level general intelligence (AGI). It lacks contextual awareness, emotional understanding, and the ability to reason abstractly across domains without guidance or structured input.Job displacement and creation: The World Economic Forum's Future of Jobs Report 2025 reveals that 40% of employers expect to reduce their workforce where AI can automate tasks.And many more. There’s a lot of conflicting information out there, making it difficult to form a clear picture. With so many differing opinions, it's important to ground the discussion in facts. So, let’s break it down from a data engineer’s point of view — by examining the available data, identifying patterns, and drawing insights that can help us make sense of it all. Navigating the Noise Let’s start with the topic that’s on everyone’s mind — layoffs. It’s the most talked-about and often the most concerning aspect of the current tech landscape. Below is a trend analysis based on layoff data collected across the tech industry. Figure 1: Layoffs (in thousands) over time in tech industries Although the first AI research boom began in the 1980s, the current AI surge started in the late 2010s and gained significant momentum in late 2022 with the public release of OpenAI's ChatGPT. The COVID-19 pandemic further complicated the technological landscape. Initially, there was a hiring surge to meet the demands of a rapidly digitizing world. However, by 2023, the tech industry experienced significant layoffs, with over 200,000 jobs eliminated in the first quarter alone. This shift was attributed to factors such as economic downturns, reduced consumer demand, and the integration of AI technologies. Since then, as shown in Figure 1, layoffs have continued intermittently, driven by various factors including performance evaluations, budget constraints, and strategic restructuring. For instance, in 2025, companies like Microsoft announced plans to lay off up to 6,800 employees, accounting for less than 3% of its global workforce, as part of an initiative to streamline operations and reduce managerial layers. Between 2024 and early 2025, the tech industry experienced significant workforce reductions. In 2024 alone, approximately 150,000 tech employees were laid off across more than 525 companies, according to data from the US Bureau of Labor Statistics. The trend has continued into 2025, with over 22,000 layoffs reported so far this year, including a striking 16,084 job cuts in February alone, highlighting the ongoing volatility in the sector. It really makes me think — have all these layoffs contributed to the rise in the US unemployment rate? And has the number of job openings dropped too? I think it’s worth taking a closer look at these trends. Figure 2: Employment and unemployment counts in the US from JOLTS DB Figure 2 illustrates employment and unemployment trends across all industries in the United States. Interestingly, the data appear relatively stable over the past few years, which raises some important questions. If layoffs are increasing, where are those workers going? And what about recent graduates who are still struggling to land their first jobs? We’ve talked about the layoffs — now let’s explore where those affected are actually going. While this may not reflect every individual experience, here’s what the available online data reveals. After the Cuts Well, I wondered if the tech job openings have decreased as well? Figure 3: Job openings over the years in the US Even with all the news about layoffs, the tech job market isn’t exactly drying up. As of May 2025, there are still around 238,000 open tech positions across startups, unicorns, and big-name public companies. Just back in December 2024, more than 165,000 new tech roles were posted, bringing the total to over 434,000 active listings that month alone. And if we look at the bigger picture, the US Bureau of Labor Statistics expects an average of about 356,700 tech job openings each year from now through 2033. A lot of that is due to growth in the industry and the need to replace people leaving the workforce. So yes — while things are shifting, there’s still a strong demand for tech talent, especially for those keeping up with evolving skills. With so many open positions still out there, what’s causing the disconnect when it comes to actually finding a job? New Wardrobe for Tech Companies If those jobs are still out there, then it’s worth digging into the specific skills companies are actually hiring for. Recent data from LinkedIn reveals that job skill requirements have shifted by approximately 25% since 2015, and this pace of change is accelerating, with that number expected to double by 2027. In other words, companies are now looking for a broader and more updated set of skills than what may have worked for us over the past decade. Figure 4: Skill bucket The graph indicates that technical skills remain a top priority, with 59% of job postings emphasizing their importance. In contrast, soft skills appear to be a lower priority, mentioned in only 46% of listings, suggesting that companies are still placing greater value on technical expertise in their hiring criteria. Figure 5: AI skill requirement in the US Focusing specifically on the comparison between all tech jobs and those requiring AI skills, a clear trend emerges. As of 2025, around 19% to 25% of tech job postings now explicitly call for AI-related expertise — a noticeable jump from just a few years ago. This sharp rise reflects how deeply AI is becoming embedded across industries. In fact, nearly one in four new tech roles now list AI skills as a core requirement, more than doubling since 2022. Figure 6: Skill distribution in open jobs Python remains the most sought-after programming language in AI job postings, maintaining its top position from previous years. Additionally, skills in computer science, data analysis, and cloud platforms like Amazon Web Services have seen significant increases in demand. For instance, mentions of Amazon Web Services in job postings have surged by over 1,778% compared to data from 2012 to 2014 While the overall percentage of AI-specific job postings is still a small fraction of the total, the upward trend underscores the growing importance of AI proficiency in the modern workforce. Final Thought I recognize that this analysis is largely centered on the tech industry, and the impact of AI can look very different across other sectors. That said, I’d like to leave you with one final thought: technology will always evolve, and the real challenge is how quickly we can evolve with it before it starts to leave us behind. We’ve seen this play out before. In the early 2000s, when data volumes were manageable, we relied on database developers. But with the rise of IoT, the scale and complexity of data exploded, and we shifted toward data warehouse developers, skilled in tools like Hadoop and Spark. Fast forward to the 2010s and beyond, we’ve entered the era of AI and data engineers — those who can manage the scale, variety, and velocity of data that modern systems demand. We’ve adapted before — and we’ve done it well. But what makes this AI wave different is the pace. This time, we need to adapt faster than we ever have in the past.

By Niruta Talwekar
Observability Expenses: When ‘Pennies on the Dollar’ Add Up Quickly
Observability Expenses: When ‘Pennies on the Dollar’ Add Up Quickly

I’ve specialized in monitoring and observability for 27 years now, and I’ve seen a lot of tools and techniques come and go (RMon, anyone?); and more than a few come and stay (rumors of the death of SNMP have been – and continue to be – greatly exaggerated). Lately I’ve been exploring one of the more recent improvements in the space – OpenTelemetry (which I’m abbreviating to “OTel” for the remainder of this blog). I wrote about my decision to dive into OTel recently: "What’s Got Me Interested in OpenTelemetry—And Pursuing Certification". For the most part, I’m enjoying the journey. But there’s a problem that has existed with observability for a while now, and it’s something OTel is not helping. The title of this post hints at the issue, but I want to be more explicit. Let’s start with some comparison shopping. Before I piss off every vendor in town, I want to be clear that these are broad, rough, high level numbers. I’ve linked to the pricing pages if you want to check the details, and I acknowledge what you see below isn’t necessarily indicative of the price you might actually pay after getting a quote on a real production environment. New Relic charges 35¢ per GB for any data you send them.…although the pricing page doesn’t make this particularly clearDatadog has a veritable laundry list of options, but at a high level, they charge: $15-$34 per host60¢ – $1.22 per million netflow records$1.06-$3.75 per million log records$1.27-$3.75 per million spansDynatrace’s pricing pagesports a list almost as long as Datadog’s but some key items: 15¢ per 100,00 metrics plus .07¢ per Gig per day for retention2¢ per gig for logs plus .07¢- per gig per day to retain themplus .035¢ per gig queriedEvents have the same rate as logs.014¢ per 1,000 spansGrafana, which – it must be noted – is open source and effectively gives you everything for free if you’re willing to do the heavy lifting of installing and hosting. But their pricing can be summed up as: $8.00 for 1k metrics, (up to 1/minute)50¢ per gig for logs and traces them, with 30 days retention This list is neither exhaustive nor complete. I’ve left off a lot of vendors, not because they also don’t have consumption based pricing but because it would just be more of the same. Even with the ones above, the details here aren’t complete. Some companies not only charge for consumption (ingest), they also charge to store the data, and charge again to query the data (looking at you, New Relic). Some companies push you to pick a tier of service, and if you don’t they’ll charge you an estimated rate based on the 99th percentile of usage for the month (looking at you, Datadog). It should surprise nobody that what appears on their pricing page isn’t even the final word. Some of these companies are, even now, looking at redefining their interpretation of the “consumption based pricing” concept that might make things even more opaque (looking at you AGAIN, New Relic). Even with all of that said, I’m going out on a limb and stating for the record that each and every one of those price points is so low that even the word “trivial” is too big. That is, until the production workloads meet the pricing sheet. At that point those itty bitty numbers add up to real money, and quickly. The Plural of Anecdote I put this question out to some friends, asking if they had real-world sticker-shock experiences. As always, my friends did not disappoint. “I did a detailed price comparison of New Relic with Datadog a couple years ago with Fargate as the main usage. New Relic was significantly cheaper until you started shipping logs and then Datadog was suddenly 30-40% cheaper even with apm. [But] their per host cost also factors in and makes APM rather unattractive unless you’re doing something serverless. We wanted to use it on kubernetes but it was so expensive, management refused to believe the costs with services on Fargate so I was usually showing my numbers every 2-3 months.”__– Evelyn Osman, Head of Platform at enmacc “All I got is the memory of the CFOs face when he saw the bill.”__– someone who prefers to remain anonymous, even though that quote is freaking epic. And of course there’s the (now infamous, in observability circles) whodunit mystery of the $65 million Datadog bill. The First Step is Admitting You Have a Problem Once upon a time (by which I mean the early 2000’s), the challenge with monitoring (observability wasn’t a term we used yet) was how to identify the data we needed, and then get the systems to give up that data, and then store that data in a way that made it possible (let alone efficient) to use in queries, displays, alerts, and such. That was where almost all the cost rested. The systems themselves were on-premises and, once the hardware was bought, effectively “free.” The result was that the accepted practice was to collect as much as possible and keep it forever. And despite the change in technology, many organizations reasoning has remained the same. Grafana Solutions Architect Alec Isaacson points out his conversations with customers sometimes go like this: “I collect CDM metrics from my most critical systems every 5 seconds because once, a long time ago, someone got yelled at when the system was slow and the metrics didn’t tell them why.” Today, collecting monitoring and observability data (“telemetry”) is comparatively easy, but – both as individuals and organizations – we haven’t changed our framing of the problem. So we continue to grab every piece of data available to us. We instrument our code with every tag and span we can think of; if there’s a log message, we ship it; hardware metrics? Better grab it because it’ll provide context; If there’s network telemetry (NetFlow, VPC Flow logs, Streaming Telemetry) we suck that up too. But we never take the time to think about what we’re going to do with it. Ms. Osman’s experience illustrates the result: “[They] had no idea what they were doing with monitoring […] all the instrumentation and logging was enabled then there was lengthy retention “just in case”. So they were just burning ridiculous amounts of money” To connect it to another bad behavior that we’ve (more or less) broken ourselves of: Back in the early days of “lift and shift” (often more accurately described as “lift and shit”) to the cloud, we not only moved applications wholesale; we moved it onto the biggest systems the platform offered. Why? Because in the old on-prem context you could only ask for a server once, and therefore you asked for the biggest thing you could get, in order to future-proof your investment. This decision turned out not only to be amusingly naive, it was horrifically expensive and it took everyone a few years to understand how “elastic compute” worked and to retool their applications for the new paradigm. Likewise, it’s high time we recognize and acknowledge that we cannot afford to collect every piece of telemetry data available to us, and moreover, that we don’t have a plan for that data even if money was no object. Admit it: Your Problem Also Has a Problem Let me pivot to OTel for a moment. One of the key reasons – possibly THE key reason – to move to it is to remove, forever and always, the pain of vendor lock-in. This is something I explored in my last blog post and was echoed recently by a friend of mine: OTel does solve a lot of the problems around “Oh great! now we’re trapped with vendor x and it’s going to cost us millions to refactor all this code” as opposed to “Oh, we’re switching vendors? Cool, let me just update my endpoint…” – Matt Macdonald-Wallace, Solutions Architect, Grafana Labs To be very clear, OTel does an amazing job at solving this problem, which is incredible in its own right. BUT… there’s a downside to OTel that people don’t notice right away, if they notice it at all. That problem makes the previous problem even worse. OTel takes all of your data (metrics, logs, traces, and the rest), collects it up, and sends it wherever you want it to go. But OTel doesn’t always do it EFFICIENTLY. Example 1: Log Messages Let’s take the log message below, which comes straight out of syslog. Yes, good old RFC 5424. Born in the 80s, standardized in 2009, and the undisputed “chatty kathy” of network message protocols. I’ve seen modestly-sized networks generate upwards of 4 million syslog messages per hour. Most of it was absolutely useless drivel, mind you. But those messages had to go somewhere and be processed (or dropped) by some system along the way. It’s one of the reasons I’ve suggested a syslog and trap “filtration system” since basically forever. Nit picking about message volume aside, there’s value in some of those messages, to some IT practitioners, some of the time. And so we have to consider (and collect) them too. <134>1 2018-12-13T14:17:40.000Z myserver myapp 10 - [http_method="GET"; http_uri="/example"; http_version="1.1"; http_status="200"; client_addr="127.0.0.1"; http_user_agent="my.service/1.0.0"] HTTP request processed successfully As-is that log message is 228 bytes – barely even a drop in the bucket of telemetry you collect every minute, let alone every day. But for what I’m about to do, I want a real apples-to-apples comparison, so here’s what it would look like if I JSON-ified it: JSON { "pri": 134, "version": 1, "timestamp": "2018-12-13T14:17:40.000Z", "hostname": "myserver", "appname": "myapp", "procid": 10, "msgid": "-", "structuredData": { "http_method": "GET", "http_uri": "/example", "http_version": "1.1", "http_status": "200", "client_addr": "127.0.0.1", "http_user_agent": "my.service/1.0.0" }, "message": "HTTP request processed successfully" } That bumps the payload up to 336 bytes without whitespace, or 415 bytes with. Now, for comparison, here’s a sample OTLP Log message: { "resource": { "service.name": "myapp", "service.instance.id": "10", "host.name": "myserver" }, "instrumentationLibrary": { "name": "myapp", "version": "1.0.0" }, "severityText": "INFO", "timestamp": "2018-12-13T14:17:40.000Z", "body": { "text": "HTTP request processed successfully" }, "attributes": { "http_method": "GET", "http_uri": "/example", "http_version": "1.1", "http_status": "200", "client_addr": "127.0.0.1", "http_user_agent": "my.service/1.0.0" } } That (generic, minimal) message weighs in at 420 bytes (without whitespace; it’s 520 bytes all-inclusive). It’s still tiny, but even so the OTel version with whitespace is 25% bigger than the JSON-ified message (with whitespace), and more than twice as large as the original log message. Once we start applying real-world data, things balloon even more. My point here is this: If OTel does that to every log message, these tiny costs add up quickly. Example 2: Prometheus It turns out that modern methods of metric management are just as susceptible to inflation. A typical prometheus metric, formatted in JSON, is 291 bytes.But that same metric converted to OTLP metrics format weighs in at 751 bytes. It’s true, OTLP has a batching function that mitigates this, but that only helps with transfer over the wire. Once it arrives at the destination, many (not all, but most) vendors unbatch before storing, so it goes back to being 2.5x larger than the original message. As my buddy Josh Biggley has said, “2.5x metrics ingest better have a fucking amazing story to tell about context to justify that cost.” It’s Not You, Otel, It’s Us (But It’s Also You) If this all feels a little hyper-critical of OTel, then please give me a chance to explain. I honestly believe that OTel is an amazing advancement and anybody who’s serious about monitoring and observability needs to adopt it as a standard – that goes for users as well as vendors. The ability to emit the braid of logs, metrics, traces while maintaining its context, regardless of destination, is invaluable. (But…) OTel was designed by (and for) software engineers. It originated in that bygone era (by which I mean “2016”) when we were still more concerned about the difficulty of getting the data than the cost of moving, processing, and storing it. OTel is, by design, biased to volume. The joke of this section’s title notwithstanding, the problem really isn’t OTel. We really are at fault. Specifically our unhealthy relationship with telemetry. If we insist on collecting and transmitting every single data point, we have nobody to blame but ourselves for the sky-high bills we receive at the end of the month. Does This Data Bring You Joy? It’s easy to let your observability solution do the heavy lifting and shunt every byte of data into a unified interface. It’s easy to do if you’re a software engineer who (nominally at least) owns the monitoring and observability solutions. It’s even easier if you’re a mere consumer of those services, an innocent bystander. Folks who fall into this category include those closely tied to a particular silo (database, storage, network, etc); or help desk and NOC teams who receive the tickets and provide support but aren’t involved in the instrumentation nor the tools the instrumentation is connected to; or teams with more specialized needs that nevertheless overlap with monitoring and observability, like information security. But let’s be honest, if you’re a security engineer, how can you justify paying twice the cost to ingest logs or metrics, versus the perfectly good standards that already exist and have served well for years? Does that mean you might be using more than one tool? Yes. But as I have pointed out time and time again, there is not (and never has been, and never will be) a one-size-fits-all solution. And in most situations there’s not even a one-size-fits-MOST solution. Monitoring and observability has always been about heterogeneous implementations. The sooner you embrace that ideal, the sooner you will begin building observability ecosystems that serve your needs, your team, and your business. To that end there’s a serious ROI discussion to be had before you go all in on OTel or any observability solution. <EOF> (For Now) We’ve seen the move from per seat (or interface, or chassis, or CPU) pricing to a consumption model in the marketplace in the past. And we’ve also seen technologies move back (like the way cell service moved from per-minute or per-text to unlimited data with a per-month charge). I suspect we may see a similar pendulum swing back with monitoring and observability at some time in the future. But for now, we have to contend with both the prevailing pricing system as it exists today; and with our own compulsion – born at a different point in the history of monitoring – to collect, transmit, and store every bit (and byte) of telemetry that passes beneath our nose. Of course, cost isn’t the only factor. Performance, risk, (and more) need to be considered. But at the heart of it all is the very real need for us to start asking ourselves: What will I do with this data?Who will use it?How long do I need to store it? And of course, Who is going to pay for it?

By Leon Adato
When Airflow Tasks Get Stuck in Queued: A Real-World Debugging Story
When Airflow Tasks Get Stuck in Queued: A Real-World Debugging Story

Recently, my team encountered a critical production issue in which Apache Airflow tasks were getting stuck in the "queued" state indefinitely. As someone who has worked extensively with Scheduler, I've handled my share of DAG failures, retries, and scheduler quirks, but this particular incident stood out both for its technical complexity and the organizational coordination it demanded. The Symptom: Tasks Stuck in Queued It began when one of our business-critical Directed Acyclic Graphs (DAGs) failed to complete. Upon investigation, we discovered several tasks were stuck in the "queued" state — not running, failing, or retrying, just permanently queued. First Steps: Isolating the Problem A teammate and I immediately began our investigation with the fundamental checks: Examined Airflow UI logs: Nothing unusual beyond standard task submission entriesReviewed scheduler and worker logs: The scheduler was detecting the DAGs, but nothing was reaching the workersConfirmed worker health: All Celery workers showed as active and runningRestarted both scheduler and workers: Despite this intervention, tasks remained stubbornly queued Deep Dive: Uncovering a Scheduler Bottleneck We soon suspected a scheduler issue. We observed that the scheduler was queuing tasks but not dispatching them. This led us to investigate: Slot availability across workersMessage queue health (RabbitMQ in our environment)Heartbeat communication logs We initially hypothesized that the scheduler machine might be over occupied because of the dual responsibility of scheduling and DAG parsing other tasks, so we increased the min_file_process_interval to 2 mins. While this reduced CPU utilization by limiting how frequently the scheduler parsed DAG files, it didn't resolve our core issue — tasks remained stuck in the queued state. After further research, we discovered that our Airflow version (2.2.2) contained a known issue causing tasks to become trapped in the queued state under specific scheduler conditions. This bug was fixed in Airflow 2.6.0, with the solution documented in PR #30375. However, upgrading wasn't feasible in the short term. The migration from 2.2.2 to 2.6.0 would require extensive testing, custom plugin adjustments, and deployment pipeline modifications — none of which could be implemented quickly without disrupting other priorities. Interim Mitigations and Configuration Optimizations While working on the backported fix, we implemented several tactical measures to stabilize the system: Increased parsing_processes to 8 to parallelize and improve the DAG parsing timeIncreased scheduler_heartbeat_sec to 30s and increased min_file_process_interval to 120s (up from the default setting of 30s) to reduce scheduler loadImplemented continuous monitoring to ensure tasks were being processed appropriatelyWe also deployed a temporary workaround using a script referenced in this GitHub comment. This script forcibly transitions tasks from queued to running state. We scheduled it via a cron job with an additional filter targeting only task instances that had been queued for more than 10 minutes. This approach provided temporary relief while we finalized our long-term solution. However, we soon discovered limitations with the cron job. While effective for standard tasks that could eventually reach completion once moved from queued to running, it was less reliable for sensor-related tasks. After being pushed to running state, sensor tasks would often transition to up_for_reschedule and then back to queued, becoming stuck again. This required the cron job to repeatedly advance these tasks, essentially functioning as an auxiliary scheduler. We suspect this behavior stems from inconsistencies between the scheduler's in-memory state and the actual task states in the database. This unintentionally made our cron job responsible for orchestrating part of the sensor lifecycle — clearly not a sustainable solution. The Fix: Strategic Backporting After evaluating our options, we decided to backport the specific fix from Airflow 2.6.0 to our existing 2.2.2 environment. This approach allowed us to implement the necessary correction without undertaking a full upgrade cycle. We created a targeted patch by cherry-picking the fix from the upstream PR and applying it to our forked version of Airflow. The patch can be viewed here: GitHub Patch. How to Apply the Patch Important disclaimer: The patch referenced in this article is specifically designed for Airflow deployments using the Celery executor. If you're using a different executor (such as Kubernetes, Local, or Sequential), you'll need to backport the appropriate changes for your specific executor from the original PR (#30375). The file paths and specific code changes may differ based on your executor configuration. If you're facing similar issues, here's how to apply this patch to your Airflow 2.2.2 installation: Download the Patch File First, download the patch from the GitHub link provided above. You can use wget or directly download the patch file: Shell wget -O airflow-queued-fix.patch https://github.com/gurmeetsaran/airflow/pull/1.patch Navigate to Your Airflow Installation Directory This is typically where your Airflow Python package is installed. Shell cd /path/to/your/airflow/installation Apply the Patch Using git Use the git apply command to apply the patch: Shell git apply --check airflow-queued-fix.patch # Test if the patch can be applied cleanly git apply airflow-queued-fix.patch # Actually apply the patch Restart your Airflow scheduler to apply the changes.Monitor task states to verify that newly queued tasks are being properly processed by the scheduler. Note that this approach should be considered a temporary solution until you can properly upgrade to a newer Airflow version that contains the official fix. Organizational Lessons Resolving the technical challenge was only part of the equation. Equally important was our approach to cross-team communication and coordination: We engaged our platform engineering team early to validate our understanding of Airflow's architecture.We maintained transparent communication with stakeholders so they could manage downstream impacts.We meticulously documented our findings and remediation steps to facilitate future troubleshooting.We learned the value of designating a dedicated communicator — someone not involved in the core debugging but responsible for tracking progress, taking notes, and providing regular updates to leadership, preventing interruptions to the engineering team. We also recognized the importance of assembling the right team — collaborative problem-solvers focused on solutions rather than just identifying issues. Establishing a safe, solution-oriented environment significantly accelerated our progress. I was grateful to have the support of a thoughtful and effective manager who helped create the space for our team to stay focused on diagnosing and resolving the issue, minimizing external distractions. Key Takeaways This experience reinforced several valuable lessons: Airflow is powerful but sensitive to scale and configuration parametersComprehensive monitoring and detailed logging are indispensable diagnostic toolsSometimes the issue isn't a failing task but a bottleneck in the orchestration layerVersion-specific bugs can have widespread impact — staying current helps, even when upgrades require planningBackporting targeted patches can be a pragmatic intermediate solution when complete upgrades aren't immediately feasibleEffective cross-team collaboration can dramatically influence incident response outcomes This incident reminded me that while technical expertise is fundamental, the ability to coordinate and communicate effectively across teams is equally crucial. I hope this proves helpful to others who find themselves confronting a mysteriously stuck Airflow task and wondering, "Now what?"

By Gurmeet Saran

Top Monitoring and Observability Experts

expert thumbnail

Eric D. Schabell

Director Technical Marketing & Evangelism,
Chronosphere

Eric is Chronosphere's Director Evangelism. He's renowned in the development community as a speaker, lecturer, author and baseball expert. His current role allows him to help the world understand the challenges they are facing with cloud native observability. He brings a unique perspective to the stage with a professional life dedicated to sharing his deep expertise of open source technologies, organizations, and is a CNCF Ambassador. Follow on https://www.schabell.org.

The Latest Monitoring and Observability Topics

article thumbnail
The AWS Playbook for Building Future-Ready Data Systems
Gone are the days when teams dumped everything into a central data warehouse and hoped analytics would magically appear.
July 9, 2025
by Junaith Haja
· 81 Views
article thumbnail
Deploy Serverless Lambdas Confidently Using Canary
Improve AWS Lambda reliability with canary deployments, gradually release updates, minimize risk, catch bugs early, and deploy faster with confidence.
July 7, 2025
by Prajwal Nayak
· 708 Views
article thumbnail
How We Broke the Monolith (and Kept Our Sanity): Lessons From Moving to Microservices
Moving from a monolith to microservices is messy but worth it — expect surprises, invest in automation, and focus on team culture as much as code.
July 3, 2025
by Shushyam Malige Sharanappa
· 1,665 Views · 2 Likes
article thumbnail
DevOps Remediation Architecture for Azure CDN From Edgio
The article explains how organizations particularly implement the migration from the retiring Azure CDN from Edgio to Azure Front Door.
June 30, 2025
by Karthik Bojja
· 1,292 Views · 1 Like
article thumbnail
Transform Settlement Process Using AWS Data Pipeline
Modern AWS data pipelines automate ETL for settlement files using S3, Glue, Lambda, and Step Functions, transforming data from raw to curated with full orchestration.
June 30, 2025
by Prabhakar Mishra
· 1,157 Views · 2 Likes
article thumbnail
Serverless Machine Learning: Running AI Models Without Managing Infrastructure
The article empowers developers to deploy and serve ML models without needing to manage servers, clusters, or VMs, reducing time-to-market and cognitive overhead.
June 26, 2025
by Bhanu Sekhar Guttikonda
· 2,071 Views · 4 Likes
article thumbnail
How to Banish Anxiety, Lower MTTR, and Stay on Budget During Incident Response
Cutting log ingestion seems thrifty — until an outage happens and suddenly you really need those signals! See how zero-cost ingestion can get rid of MTTR anxiety.
June 26, 2025
by John Vester DZone Core CORE
· 1,355 Views · 1 Like
article thumbnail
How to Monitor and Optimize Node.js Performance
Optimize Node.js apps with tools and techniques for better performance, learn monitoring, reduce memory leaks, and improve scalability and responsiveness easily.
June 26, 2025
by Anubhav D
· 1,300 Views · 2 Likes
article thumbnail
IBM App Connect Enterprise 13 Installation on Azure Kubernetes Service (AKS)
This article provides a step-by step guide which shows how to install IBM App Connect Enterprise 13 in an Azure Kubernetes Service Cluster.
June 25, 2025
by JEAN PAUL TABJA
· 1,468 Views · 1 Like
article thumbnail
Real-Object Detection at the Edge: AWS IoT Greengrass and YOLOv5
Real-time object detection at the edge using YOLOv5 and AWS IoT Greengrass enables fast, offline, and scalable processing in bandwidth-limited or remote environments.
June 23, 2025
by Anil Jonnalagadda
· 1,647 Views · 13 Likes
article thumbnail
Your Kubernetes Survival Kit: Master Observability, Security, and Automation
Master Kubernetes with this guide to observability (Tracestore), security (OPA), automation (Flagger), and custom metrics. Includes Java/Node.js examples.
June 20, 2025
by Prabhu Chinnasamy
· 2,562 Views · 44 Likes
article thumbnail
How to Achieve SOC 2 Compliance in AWS Cloud Environments
Achieving SOC 2 compliance in AWS requires planning, rigorous implementation, and ongoing commitment to security best practices.
June 17, 2025
by Chase Bolt
· 1,114 Views · 1 Like
article thumbnail
Mastering Kubernetes Observability: Boost Performance, Security, and Stability With Tracestore, OPA, Flagger, and Custom Metrics
This guide walks you through using Tracestore, OPA, Flagger, and custom metrics to make Kubernetes more observable, with better tracing, policy control, and performance.
June 16, 2025
by Prabhu Chinnasamy
· 1,848 Views · 44 Likes
article thumbnail
Building Generative AI Services: An Introductory and Practical Guide
Amazon Bedrock simplifies AI app development with serverless APIs, offering Q&A, summarization, and image generation using top models like Claude and Stability AI.
June 11, 2025
by Srinivas Chippagiri DZone Core CORE
· 1,990 Views · 7 Likes
article thumbnail
Exploring Reactive and Proactive Observability in the Modern Monitoring Landscape
Embracing the shift in landscape from traditional monitoring systems to automated anomaly detection, pattern recognition, and root cause analysis.
June 10, 2025
by Abeetha Bala
· 936 Views · 1 Like
article thumbnail
Secure Your Oracle Database Passwords in AWS RDS With a Password Verification Function
Enforce strong password policies for Oracle databases on AWS RDS using built-in or custom verification functions via the rdsadmin package.
June 10, 2025
by arvind toorpu DZone Core CORE
· 965 Views · 1 Like
article thumbnail
From Code to Customer: Building Fault-Tolerant Microservices With Observability in Mind
Learn how to build resilient Kubernetes microservices with fault tolerance, SRE practices, and observability from code to customer.
June 9, 2025
by Ravi Teja Thutari
· 1,209 Views · 4 Likes
article thumbnail
Secure IaC With a Shift-Left Approach
Shift-Left secure Infrastructure as Code helps catch issues early, automate compliance, and build secure, scalable cloud infrastructure.
June 6, 2025
by Josephine Eskaline Joyce DZone Core CORE
· 1,728 Views · 3 Likes
article thumbnail
Finding Needles in Digital Haystacks: The Distributed Tracing Revolution
Use distributed tracing—the key third pillar of observability—to track requests across microservices and turn debugging from guesswork into precise insights.
June 6, 2025
by Rishab Jolly
· 1,496 Views · 2 Likes
article thumbnail
OTel Me Why: The Case for OpenTelemetry Beyond the Shine
Someone asked me why I was so excited about OpenTelemetry. The reasons have more to do with it's innovation and utility than it's novelty.
June 4, 2025
by Leon Adato
· 666 Views · 3 Likes
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: