Cloud Architecture Resources

DZone's Featured Cloud Architecture Resources

Building an AI Agent That Responds to Real-Time Events With AWS Bedrock, Kinesis, DynamoDB, and S3

By Jubin Abhishek Soni

CORE

Most recommendation systems are batch jobs. They crunch last night's data, write a recommendations table, and serve it all day. That works fine until your user watches three thriller movies in a row at 9 pm and your system is still recommending rom-coms because the batch hasn't run yet. In this post, I'll walk through building an agent system that reacts to streaming user behavior in real time using: Amazon Kinesis to ingest and route eventsAWS Lambda to process, enrich, and trigger reasoningAmazon Bedrock as the reasoning and recommendation layerDynamoDB to store user profiles and recommendation cacheS3 for raw event archiving and model artifacts By the end, you'll have an architecture where a user's recommendation set updates within seconds of their behavior changing. Architecture Overview The system has three layers: LayerServicesResponsibilityIngestKinesis Data Streams, Kinesis FirehoseCapture and fan-out user eventsProcess & ReasonLambda, Amazon Bedrock AgentEnrich events, build context, invoke LLMStore & ServeDynamoDB, S3Persist profiles, cache recs, store artifacts The key design decision is keeping the hot path (Kinesis → Lambda → Bedrock → DynamoDB) fully async and the serving path (API → DynamoDB cache) completely decoupled. The user never waits for Bedrock to respond; they get the last cached recommendation set while a fresh one is already being computed in the background. Event Flow Here's what happens end to end when a user clicks on a product: The app publishes a user.interaction event to Kinesis Data StreamsKinesis fans the event out to two consumers: Lambda Processor and Kinesis FirehoseFirehose archives the raw event to S3 (cheap, durable, great for retraining later)Lambda enriches the event with user history from DynamoDB User Profiles, then invokes the Bedrock AgentThe Bedrock Agent reasons over the enriched context (recent events + profile + item catalog embeddings from S3) and writes a fresh recommendation set to DynamoDB Rec CacheThe client app reads recommendations from the cache via a lightweight Lambda API — no Bedrock latency in the hot path Code: Publishing Events to Kinesis This is your app-side producer. Keep it thin — just serialize and publish. Do all enrichment downstream. Python import boto3 import json import uuid from datetime import datetime, timezone kinesis = boto3.client('kinesis', region_name='us-east-1') def publish_interaction(user_id: str, item_id: str, event_type: str, metadata: dict = {}): """ Publish a user interaction event to Kinesis Data Streams. Partition key is user_id so all events for a user land on the same shard. """ event = { 'event_id': str(uuid.uuid4()), 'user_id': user_id, 'item_id': item_id, 'event_type': event_type, # 'click', 'purchase', 'dwell', 'skip' 'timestamp': datetime.now(timezone.utc).isoformat(), 'metadata': metadata, } response = kinesis.put_record( StreamName='user-interactions', Data=json.dumps(event).encode('utf-8'), PartitionKey=user_id, # consistent routing per user ) return response['SequenceNumber'] # Example call publish_interaction( user_id='u_8821', item_id='prod_thriller_042', event_type='purchase', metadata={'price': 14.99, 'category': 'thriller', 'session_id': 'sess_xyz'} ) Tip: Use user_id as the partition key so all events for a given user land on the same shard and arrive in order. This matters when Lambda is building a recency-ordered event window. Code: Lambda Processor — Enrich and Invoke Bedrock This is the core of the pipeline. The Lambda reads from the Kinesis stream, pulls user context from DynamoDB, and invokes the Bedrock Agent with a structured prompt. Python import boto3 import json import os from datetime import datetime, timezone dynamodb = boto3.resource('dynamodb') bedrock = boto3.client('bedrock-agent-runtime', region_name='us-east-1') profiles_table = dynamodb.Table(os.environ['PROFILES_TABLE']) # DynamoDB User Profiles rec_table = dynamodb.Table(os.environ['REC_CACHE_TABLE']) # DynamoDB Rec Cache AGENT_ID = os.environ['BEDROCK_AGENT_ID'] AGENT_ALIAS = os.environ['BEDROCK_AGENT_ALIAS'] MAX_HISTORY = 20 # last N events to include in context def handler(event, context): for record in event['Records']: # Kinesis payload is base64-encoded payload = json.loads(record['kinesis']['data']) process_event(payload) def process_event(payload: dict): user_id = payload['user_id'] item_id = payload['item_id'] evt_type = payload['event_type'] # 1. Fetch user profile + recent history from DynamoDB response = profiles_table.get_item(Key={'user_id': user_id}) profile = response.get('Item', {'user_id': user_id, 'history': [], 'preferences': {}) # 2. Append current event and trim to window profile['history'].append({ 'item_id': item_id, 'event_type': evt_type, 'timestamp': payload['timestamp'], 'metadata': payload.get('metadata', {}), }) profile['history'] = profile['history'][-MAX_HISTORY:] # 3. Write enriched profile back profiles_table.put_item(Item=profile) # 4. Build prompt for Bedrock Agent prompt = build_personalization_prompt(profile) # 5. Invoke Bedrock Agent agent_response = bedrock.invoke_agent( agentId=AGENT_ID, agentAliasId=AGENT_ALIAS, sessionId=user_id, # session per user keeps conversational context inputText=prompt, ) # 6. Parse streaming response chunks recommendations = parse_agent_response(agent_response) # 7. Write to recommendation cache rec_table.put_item(Item={ 'user_id': user_id, 'recommendations': recommendations, 'generated_at': datetime.now(timezone.utc).isoformat(), 'ttl': int(datetime.now(timezone.utc).timestamp()) + 3600, # 1hr TTL }) def build_personalization_prompt(profile: dict) -> str: history_summary = '\n'.join([ f"- [{e['event_type'].upper()}] item={e['item_id']} category={e['metadata'].get('category','unknown')}" for e in profile['history'][-10:] ]) return f"""You are a real-time personalization agent. User profile: {json.dumps(profile.get('preferences', {}))} Recent interactions (most recent last): {history_summary} Based on this behavior, return exactly 5 personalized item recommendations as a JSON array. Each item must include: item_id, category, reasoning (1 sentence), confidence_score (0-1). Return only valid JSON. No explanation outside the JSON block.""" def parse_agent_response(agent_response) -> list: full_text = '' for event in agent_response['completion']: if 'chunk' in event: full_text += event['chunk']['bytes'].decode('utf-8') try: # Extract JSON from response start = full_text.index('[') end = full_text.rindex(']') + 1 return json.loads(full_text[start:end]) except (ValueError, json.JSONDecodeError): return [] Code: Serving Recommendations via Lambda API The serving layer never touches Bedrock. It reads purely from the DynamoDB cache, keeping p99 latency well under 10ms. Python import boto3 import json import os from datetime import datetime, timezone dynamodb = boto3.resource('dynamodb') rec_table = dynamodb.Table(os.environ['REC_CACHE_TABLE']) FALLBACK_RECS = ['popular_001', 'popular_002', 'popular_003'] # cold-start fallback def handler(event, context): user_id = event['pathParameters']['userId'] response = rec_table.get_item(Key={'user_id': user_id}) item = response.get('Item') if not item: # Cold start: user has no history yet return api_response(200, { 'user_id': user_id, 'recommendations': FALLBACK_RECS, 'source': 'fallback', 'generated_at': None, }) age_seconds = ( datetime.now(timezone.utc) - datetime.fromisoformat(item['generated_at']) ).total_seconds() return api_response(200, { 'user_id': user_id, 'recommendations': item['recommendations'], 'source': 'cache', 'generated_at': item['generated_at'], 'cache_age_sec': int(age_seconds), }) def api_response(status: int, body: dict) -> dict: return { 'statusCode': status, 'headers': { 'Content-Type': 'application/json', 'Access-Control-Allow-Origin': '*', }, 'body': json.dumps(body), } Service Comparison: Why Each AWS Service? ServiceWhy it's hereAlternative consideredKinesis Data StreamsOrdered, replayable, millisecond-latency fan-outSQS (no ordering guarantee per user), EventBridge (higher latency)Kinesis FirehoseZero-code delivery to S3 for archivingWriting to S3 directly in Lambda (adds failure surface)LambdaEvent-driven, scales to 0, tight Kinesis integrationECS Fargate (overkill for stateless enrichment)Amazon BedrockManaged LLM with agent runtime, no infra to maintainSelf-hosted model on SageMaker (more control, much more ops)DynamoDBSingle-digit ms reads, TTL support, scales automaticallyRDS (too slow for hot path), ElastiCache (extra cost for separate store)S3Cheap durable archive + model artifact storeDynamoDB for raw events (expensive and unnecessary) Things to Watch in Production Bedrock latency is variable. Claude Sonnet typically responds in 1-4 seconds but can spike. Since recs are written async to cache, this doesn't affect user-facing latency, but it does affect freshness. Monitor bedrock:InvokeAgent duration in CloudWatch. Kinesis shard scaling. One shard handles 1MB/s write or 1000 records/s. At 10k active users, you'll need to plan shard count carefully. Use Enhanced Fan-Out if you have multiple Lambda consumers reading the same stream. DynamoDB TTL for cache eviction. The serving Lambda sets a 1-hour TTL on each rec entry. If Bedrock hasn't updated the cache in over an hour (e.g., Lambda errors), users fall back to the popular items list. Adjust TTL based on how stale you can tolerate. Cold start users. New users have no history, so the Bedrock prompt has nothing useful to reason over. I recommend a popularity-based fallback as shown in the serving Lambda, and switching to personalized recs after the user's first 3-5 interactions. Wrapping Up The pattern here is worth generalizing: keep the reasoning layer (Bedrock) fully off the hot serving path. Write results to a fast cache (DynamoDB), serve from the cache, and let the agent pipeline update it continuously in the background. This gives you the intelligence of an LLM-powered agent without the latency of one. The same pattern applies to fraud scoring, content moderation queues, ops alerting — anywhere you need a reasoning system that reacts to real-time streams without blocking the user experience. References Amazon Kinesis Data Streams Developer GuideAmazon Kinesis Data Firehose Developer GuideAmazon Bedrock Agent Runtime — Invoke Agent APIAWS Lambda — Using AWS Lambda with Amazon KinesisAmazon DynamoDB — Time to Live (TTL)Amazon S3 — Best practices for event-driven architecturesBuilding Agents with Amazon BedrockEvent-Driven Architecture on AWS — Whitepaper More

Azure Databricks for Scalable MLOps and Feature Engineering With Apache Spark, Delta Lake, and MLflow

By Jubin Abhishek Soni

CORE

Raw data doesn't win model competitions. Features do. And when your raw data is tens of billions of rows sitting across multiple sources, you can't afford to run pandas in a notebook and call it a day. In this tutorial, I'll walk through building a production-grade feature engineering pipeline on Azure Databricks using: Apache Spark for distributed transformation at scaleDelta Lake for reliable, versioned feature storage with ACID guaranteesMLflow for tracking feature pipeline runs, parameters, and the models trained on top of them The use case is a customer churn prediction system, but the patterns apply to any ML feature pipeline. Architecture Overview The pipeline follows the Medallion Architecture — a layered approach where data gets progressively cleaner and more feature-ready as it moves from Bronze to Silver to Gold. MLflow sits across all three layers, tracking every run. Pipeline Flow Layer Breakdown LayerDelta TableWhat happens hereTypical latencyBronzechurn.bronze.eventsRaw ingest, no transforms, append onlyMinutesSilverchurn.silver.customersDeduplication, null handling, schema enforcementMinutesGoldchurn.gold.featuresAggregations, window functions, encodingMinutes to hoursMLflow RunN/ATraining, metric logging, artifact storageHoursRegistryN/AVersioned model store, stage promotionOn demand Step 1 — Bronze Layer: Raw Ingest The Bronze layer is append-only. No transforms. No business logic. Just get the data in and preserve it exactly as it arrived so you can always replay from source. Python from pyspark.sql import SparkSession from pyspark.sql.functions import current_timestamp, lit from delta.tables import DeltaTable spark = SparkSession.builder.getOrCreate() # Read raw events from ADLS Gen2 / Event Hub / source of choice raw_events = spark.read.format('json').load('abfss://[email protected]/events/') # Add ingestion metadata — never mutate source columns bronze_df = raw_events.withColumn('_ingested_at', current_timestamp()) \ .withColumn('_source', lit('events_api')) # Write to Bronze Delta table — append only, no overwrites bronze_df.write \ .format('delta') \ .mode('append') \ .option('mergeSchema', 'true') \ .saveAsTable('churn.bronze.events') print(f"Bronze rows written: {bronze_df.count()}") Why append-only? If your downstream pipeline produces bad features, you want to replay from Bronze without re-ingesting from source. Overwriting Bronze breaks that ability. Step 2 — Silver Layer: Clean and Validate Silver is where you enforce schema, handle nulls, deduplicate, and standardize. Think of it as your canonical, trusted dataset. Python from pyspark.sql.functions import col, to_timestamp, when, trim, upper from delta.tables import DeltaTable bronze = spark.table('churn.bronze.events') silver_df = bronze \ .filter(col('customer_id').isNotNull()) \ .filter(col('event_type').isNotNull()) \ .dropDuplicates(['customer_id', 'event_id']) \ .withColumn('event_ts', to_timestamp(col('event_timestamp'))) \ .withColumn('event_type', upper(trim(col('event_type')))) \ .withColumn('country_code', when(col('country').isNull(), lit('UNKNOWN')) .otherwise(upper(col('country')))) \ .select( 'customer_id', 'event_id', 'event_type', 'event_ts', 'country_code', 'product_id', 'session_id', '_ingested_at', ) # Upsert into Silver using Delta MERGE — idempotent on re-runs if DeltaTable.isDeltaTable(spark, 'churn.silver.customers'): silver_table = DeltaTable.forName(spark, 'churn.silver.customers') silver_table.alias('tgt').merge( silver_df.alias('src'), 'tgt.customer_id = src.customer_id AND tgt.event_id = src.event_id' ).whenNotMatchedInsertAll().execute() else: silver_df.write.format('delta').saveAsTable('churn.silver.customers') print(f"Silver table updated. Total rows: {spark.table('churn.silver.customers').count()}") Step 3 — Gold Layer: Feature Engineering This is the heart of the pipeline. We compute aggregated, windowed, and encoded features that the model will actually train on. Python from pyspark.sql.functions import ( col, count, countDistinct, sum as _sum, avg, datediff, max as _max, min as _min, current_date, expr, when ) from pyspark.sql.window import Window silver = spark.table('churn.silver.customers') # ------------------------------------------------------------------ # 1. Aggregate features per customer over 30 / 90 day windows # ------------------------------------------------------------------ today = current_date() agg_features = silver \ .withColumn('days_since_event', datediff(today, col('event_ts'))) \ .groupBy('customer_id') \ .agg( count('event_id') .alias('total_events'), countDistinct('session_id') .alias('total_sessions'), countDistinct('product_id') .alias('distinct_products'), _sum(when(col('days_since_event') <= 30, 1).otherwise(0)) .alias('events_last_30d'), _sum(when(col('days_since_event') <= 90, 1).otherwise(0)) .alias('events_last_90d'), _max('event_ts') .alias('last_event_ts'), _min('event_ts') .alias('first_event_ts'), ) \ .withColumn('days_since_last_event', datediff(today, col('last_event_ts'))) \ .withColumn('customer_tenure_days', datediff(today, col('first_event_ts'))) \ .withColumn('avg_events_per_day', col('total_events') / (col('customer_tenure_days') + 1)) # ------------------------------------------------------------------ # 2. Encode churn risk tier as ordinal feature # ------------------------------------------------------------------ feature_df = agg_features \ .withColumn('recency_tier', when(col('days_since_last_event') <= 7, lit(3)) # active .when(col('days_since_last_event') <= 30, lit(2)) # at risk .otherwise(lit(1)) # churned ) \ .withColumn('engagement_score', (col('events_last_30d') * 0.6 + col('events_last_90d') * 0.4) / (col('customer_tenure_days') + 1) ) # ------------------------------------------------------------------ # 3. Write to Gold feature store — overwrite with partition by date # ------------------------------------------------------------------ feature_df \ .withColumn('feature_date', current_date()) \ .write \ .format('delta') \ .mode('overwrite') \ .option('replaceWhere', f"feature_date = '{today}'") \ .saveAsTable('churn.gold.features') print(f"Gold features written: {feature_df.count()} customers") Step 4 — MLflow: Track the Training Run With features in Gold, we hand off to MLflow to train, track, and register the model. Notice we log the Delta table version so we can always reproduce exactly which feature snapshot trained which model. Python import mlflow import mlflow.sklearn from mlflow.models.signature import infer_signature from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score, f1_score import pandas as pd mlflow.set_experiment('/churn-prediction/feature-pipeline') # Read Gold features — capture Delta version for reproducibility gold_table = DeltaTable.forName(spark, 'churn.gold.features') delta_version = gold_table.history(1).select('version').collect()[0][0] features_pdf = spark.table('churn.gold.features').toPandas() FEATURE_COLS = [ 'total_events', 'total_sessions', 'distinct_products', 'events_last_30d', 'events_last_90d', 'days_since_last_event', 'customer_tenure_days', 'avg_events_per_day', 'recency_tier', 'engagement_score', ] TARGET = 'churned' X = features_pdf[FEATURE_COLS] y = features_pdf[TARGET] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) with mlflow.start_run(run_name=f'gbm-features-v{delta_version}') as run: params = {'n_estimators': 200, 'max_depth': 5, 'learning_rate': 0.05} model = GradientBoostingClassifier(**params, random_state=42) model.fit(X_train, y_train) y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1] # Log everything mlflow.log_params(params) mlflow.log_metric('roc_auc', roc_auc_score(y_test, y_prob)) mlflow.log_metric('f1_score', f1_score(y_test, y_pred)) mlflow.log_param('delta_feature_version', delta_version) mlflow.log_param('feature_columns', FEATURE_COLS) mlflow.log_param('training_rows', len(X_train)) # Log model with signature signature = infer_signature(X_train, y_pred) mlflow.sklearn.log_model( model, artifact_path='churn-gbm', signature=signature, registered_model_name='churn-prediction-gbm', ) print(f"Run ID: {run.info.run_id}") print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}") print(f"Feature Delta version logged: {delta_version}") Bonus: Delta Lake Time Travel for Feature Reproducibility One of the best things about Delta Lake is time travel. If a model behaves unexpectedly in production, you can reload the exact feature snapshot it was trained on. Python # Reload the exact feature version that trained a specific model run import mlflow run = mlflow.get_run('your-run-id-here') feature_version = int(run.data.params['delta_feature_version']) # Rehydrate that exact feature snapshot historical_features = spark.read \ .format('delta') \ .option('versionAsOf', feature_version) \ .table('churn.gold.features') print(f"Loaded feature snapshot from Delta version {feature_version}") print(f"Row count: {historical_features.count()}") # You can now retrain on the exact same data to reproduce the result Service Comparison ToolRole in pipelineWhy not the alternativeApache SparkDistributed feature computationPandas (single node, OOM at scale), Dask (less native Databricks integration)Delta LakeFeature storage with versioningParquet (no ACID, no time travel), Hive tables (no merge support)MLflow TrackingExperiment and param loggingManual logging (not reproducible), W&B (extra cost, less native on Databricks)MLflow RegistryModel versioning and promotionCustom model store (more ops overhead)Medallion ArchitecturePipeline layer separationFlat pipelines (hard to debug, no replay capability)Delta MERGEIdempotent Silver upsertsOverwrite (destroys history), append (creates duplicates) Things to Watch in Production Shuffle partitions matter. Spark defaults to 200 shuffle partitions, which is fine for small data but will bottleneck at scale. Set spark.conf.set("spark.sql.shuffle.partitions", "auto") on Databricks Runtime 10+ or tune it manually to 2-3x your core count. Z-ordering on Gold features. If you're querying Gold by customer_id frequently, add OPTIMIZE churn.gold.features ZORDER BY (customer_id) after the write. This co-locates related data and cuts query times dramatically on large tables. Log Delta version in every MLflow run. This is non-negotiable for reproducibility. Without it you can't prove which feature snapshot trained which model, which becomes a compliance problem in regulated industries. Cluster autoscaling for feature jobs. Feature engineering jobs tend to have spiky resource needs (big during aggregation, small during writes). Enable autoscaling on your Databricks cluster and set a min/max node count rather than a fixed size. Wrapping Up The combination of Spark, Delta Lake, and MLflow on Databricks gives you a feature engineering pipeline that is reproducible (Delta time travel + MLflow param logging), scalable (Spark handles billions of rows), and auditable (every run is tracked, every feature version is stored). The Medallion Architecture keeps the pipeline modular — you can rerun just the Gold layer if you change a feature definition without touching Bronze or Silver, and MLflow ties model performance back to the exact feature version that produced it. References Azure Databricks DocumentationDelta Lake — The Definitive GuideApache Spark SQL — Window FunctionsMLflow Tracking DocumentationMLflow Model RegistryMedallion Architecture on DatabricksDelta Lake Time TravelDatabricks Feature Store Overview More

Beyond Root Cause: Building Effective Blameless Postmortems for Cloud-Native Systems

By Akshay Pratinav

High-Cardinality Threat Detection: Why MapReduce Breaks and Heuristics Win

By Karanpreet Singh

One Stolen Key, One Stolen Token: Why Machine Identity Is Cloud-Native's Quietest Crisis — and the Only Fix That Actually Holds

By Igboanugo David Ugochukwu

CORE

Building Production-Safe Agentic Remediation With Docker MCP Gateway: Lessons From 43% to 100% Accuracy

Our first version was wrong 57% of the time. Not because the AI model couldn't identify Docker container failure scenarios—it usually could. The failures occurred at the decision boundary: determining when an automated action was appropriate, when escalation was required, and when no action should be taken. Over several weeks, we built and evaluated an AI-assisted remediation system on Docker MCP Gateway across four container failure scenarios, improving decision correctness from 43% to 100%. What we learned surprised us: the hard problem is not teaching the agent to act. The hard problem is defining and enforcing the boundary where the agent must stop acting. The project reinforced a broader lesson: production-safe AI is less about model intelligence and more about engineering explicit policies, validation mechanisms, and execution controls. This article covers what we built, what failed, and the engineering changes that improved correctness. The full code, audit logs, validation datasets, and analyzer scripts are all in the companion repository. Why Naive Auto-Remediation Is Dangerous The most common mistake in AI-driven operations is treating "AI can fix things" as the goal. It isn't. A remediation system that attempts to fix every incident automatically is often worse than having no automation at all. Consider the failure modes: An automatic restart of a CrashLoopBackOff container does not fix the underlying problem—it simply generates more alerts. The container will fail again because the code or configuration issue remains unchanged. The result is additional operational noise without any meaningful remediation. Automatically increasing memory limits for every OOM event can be equally problematic. The workload continues running, but the underlying memory leak remains hidden. Months later, teams may find themselves running multi-gigabyte containers that should have been consuming a fraction of those resources. Automated remediation without an audit trail creates a different problem: a lack of accountability. Without structured records, it becomes impossible to determine what actions were taken, what actions were considered, and why a particular remediation path was selected. "The AI fixed it" is not a useful postmortem entry. The safest remediation systems are not the ones that automate the most actions. They are the ones with clearly defined operational boundaries, explicit escalation rules, and auditable decision paths. The engineering challenge is not maximizing automation — it is determining where automation should stop. According to Mohammad-Ali A'râbi, Docker Captain: One of the most dangerous assumptions teams can make is treating a language model as if it were an experienced senior site reliability engineer. It is not. A language model may generate useful recommendations, but it has no operational accountability. It does not understand business context, service ownership, deployment history, or the downstream consequences of an action. Any system granted the ability to modify production infrastructure must therefore be treated as an untrusted component operating behind strict controls. The container ecosystem learned this lesson years ago through the principle of least privilege. We stopped running containers as root whenever possible. We reduced Linux capabilities to the minimum required set. We learned that mounting Docker sockets into containers for convenience often created unacceptable security risks. The common theme was simple: convenience should not bypass security boundaries. The same principle applies to operational automation. Granting unrestricted access to restart workloads, modify resource limits, or execute privileged actions without meaningful controls introduces unnecessary risk. The challenge is not improving the quality of recommendations. The challenge is ensuring that every action is constrained, observable, and reversible. This is where Docker MCP Gateway becomes valuable. Rather than allowing direct access to infrastructure operations, the Gateway places a controlled execution layer between the decision-making component and the underlying tools. Authentication, rate limiting, audit logging, input validation, and execution isolation are applied consistently before any action is performed. In our implementation, every tool invocation passed through HMAC authentication, Redis-backed rate limiting, structured audit logging, and containerized execution. These controls were not added as enhancements; they were treated as core design requirements. Production systems already rely on admission controllers, access controls, audit trails, and policy enforcement. Operational automation should be held to the same standard. Access to credentials should remain isolated from the decision-making layer. Direct access to host resources should be minimized. Every action should be traceable and reviewable. The more authority a system is given, the more important it becomes to enforce clear operational boundaries. Reliable automation depends less on unrestricted capability and more on well-defined constraints. What Docker MCP Gateway Gives You At a high level, Docker MCP Gateway acts as a secure control plane between AI agents and MCP tools, enforcing authentication, rate limits, audit logging, and execution isolation for every tool call. The Model Context Protocol (MCP) is an open standard introduced by Anthropic in late 2024 that gives AI applications a uniform interface for invoking external tools and services. It has since gained support across multiple vendors, including Anthropic, OpenAI, Google DeepMind, and AWS. MCP solves the protocol problem. It doesn't solve the production problem. Production systems require controls around tool execution, not just a standardized way to invoke tools Authenticated tool calls (not just "the agent has the API key in plaintext somewhere")Rate limiting (agents can spiral fast)Audit logging of every decisionContainerized tool isolation (so a misbehaving tool can't take down its host)Centralized policy enforcement (so adding a new server doesn't require reconfiguring every client) Docker MCP Gateway provides these operational controls. It sits between AI clients and MCP servers, routing every tool invocation through a centralized enforcement layer that handles authentication, policy enforcement, rate limiting, and execution isolation. For our work, we built a custom MCP server inside Docker that exposes three remediation tools: check_container_logs, restart_container, and update_container_resources. Every request passes through HMAC authentication, is rate-limited using Redis, and is recorded in a structured JSON audit log before execution.mc From Mohammad-Ali A'râbi, Docker Captain: Docker's AI tooling strategy is fundamentally about building a verifiable supply chain for reasoning engines. You cannot build secure AI on top of bloated, vulnerable foundations. The strategy begins with Docker Hardened Images (DHI), providing agents and MCP servers with minimal attack-surface base images backed by cryptographically signed SLSA Level 3 provenance. The Docker Hub MCP then acts as a discovery layer, allowing agents to find and navigate trusted container artifacts through natural-language interactions. From there, these components converge into Docker AI Governance, where MicroVM-based sandboxes apply strict, deny-by-default controls over filesystem access, network connectivity, and tool execution. Together, these capabilities represent a broader architectural shift from securing application code to securing an agent's entire operational blast radius. Recent supply-chain attacks such as Shai-Hulud 2.0 have shown that modern attackers increasingly target the automation layers that underpin software delivery. AI agents now operate inside those same environments, making blast-radius reduction a first-class architectural concern. A Decision Framework: When to Auto-Fix vs. Escalate Before implementing any automation, we documented the expected behavior for each failure mode. This was not a planning exercise—it became the specification the system had to satisfy and later served as the foundation for our validation framework. Failure Type Likely Cause Safe Action OOMKilled Resource exhaustion (often legitimate) Auto-fix: increase memory CrashLoopBackOff Code or configuration bug Escalate — never auto-restart Single Exit (code 1) Could be transient (network, DB) or persistent Try restart once, escalate if it persists HealthCheckFailure App stuck or deadlocked Auto-fix: restart The guiding principle was simple: transient and resource-related failures could be remediated automatically, while persistent application and configuration failures required escalation. Transient and resource-driven failures auto-fix. Persistent and code-driven failures escalate. Every decision is logged. This framing matters more than the implementation. It's the part you should keep even if you replace every other piece of the system. The agent's job isn't to be smart — it's to apply this rule consistently and visibly. We chose to encode this in the agent's system prompt rather than in code branching, which turned out to be one of our most important design decisions. More on that below. The Architecture in Practice The system has five logical layers running across three Docker Compose containers: Five-layer architecture: container failure triggers the AI agent, which routes every tool call through the Docker MCP Gateway security pipeline before reaching MCP Tools and the Docker API. The architecture separates concerns into five layers. The AutoGen agent (GPT-3.5-turbo, cost-optimized for this decision space) handles reasoning and decision-making. The Docker MCP Gateway sits in front of the tools as a security enforcement point — every tool call passes through HMAC authentication, Redis-backed rate limiting (100 requests/hour), input validation, and structured audit logging. The MCP Tools layer exposes three remediation actions: check_container_logs, restart_container, and update_container_resources. Below that, the Docker API performs the actual container operations. In our current implementation, the Gateway and Tools layers are colocated in a single Python service for simplicity — in a multi-tenant production setup you'd separate them into distinct services that scale independently. Every tool call generates an audit log entry like this: JSON { "timestamp": "2026-05-07T02:08:15.456Z", "incident_id": "inc-20260507-020815", "agent_id": "docker-ops-agent-001", "alert": { "description": "Docker container crashed with OOMKilled", "container_id": "nginx-oom-test", "status": "OOMKilled" }, "decision_chain": [ {"tool": "check_container_logs", "result": "..."}, {"tool": "update_container_resources", "result": "Memory limit updated to 200MB"} ], "resolved": true } That structured output is what makes the system auditable. It's also what makes our validation work possible. The Engineering Reality: 43% to 100% Across 7 development-phase incidents, our agent made the correct decision 43% of the time. Across 6 validation-phase incidents after applying our fixes, it was correct 100% of the time. Both datasets are committed in the repo's monitoring/analysis directory. Phase Runs Correct Avg Turns/Incident Before fixes 7 3/7 (43%) 22.7 After fixes 6 6/6 (100%) 11.7 A note on sample size: this is a small dataset. It's enough to show the expected behavior is reproducible across the four scenarios, but not enough to make claims about reliability under load or at scale. What changed between the two phases is documented as nine challenges in the lab README. Three of them drove most of the improvement. Here they are. Challenge A: The OOM That Couldn't Be Fixed In the early runs, the agent correctly diagnosed an OOMKilled container, called the memory-update tool, and got back this Docker error: Plain Text Memory limit should be smaller than already set memoryswap limit, update the memoryswap at the same time Then it correctly escalated, because it had no tool for updating memoryswap. Our analyzer marked this as wrong because the OOMKilled scenario expected AutoResolved, not Escalated. But the agent's logic was right. The bug wasn't in the agent — it was in our test container's --memory-swap configuration. Once we fixed that (set --memory-swap=-1 for unlimited swap), the agent's behavior didn't change at all. The same logic that escalated correctly before now succeeded correctly. The agent went from 0/2 to 2/2 correct. Lesson: When the agent makes the right decision but your tests say it's wrong, check the test setup before blaming the agent. We spent a few hours debugging the agent before realizing our own container configuration was the problem. Challenge B: The Over-Eager Restart In the first three CrashLoopBackOff runs, the agent restarted the container 2 out of 3 times. CrashLoopBackOff is exactly the failure mode where you should never restart — the container is crashing because of a code or config bug, not a transient state. Restarting just generates more crashes. We almost wrote a code branch for it: add a check, route CrashLoopBackOff to a different path. Before doing that, we tried tightening the system prompt instead: Plain Text For CrashLoopBackOff failures: ALWAYS escalate to a human operator. NEVER attempt to restart the container. Restarting will only cause the container to crash again. Your role is to diagnose and report, not to fix. That single change — no code, just words in the prompt — made the agent consistently escalate on every subsequent run. Lesson: If you want the agent to follow a rule, write the rule down in the system prompt. Don't leave it to the model to figure out. We spent more time arguing about whether to add code branching than the prompt change actually took. Challenge C: The Hallucinated Containers After resolving real incidents, the agent started making up alerts for containers that didn't exist — memory-hungry-app, app-crash-loop, none of which were ever in our system. It was inventing failures and then "responding" to them. Root cause: AutoGen's max_consecutive_auto_reply was set to 10. After the agent finished a real incident, the conversation framework kept giving it turns. Without a real prompt to respond to, it generated plausible-looking next incidents and walked itself through fake remediations. Fix: drop max_consecutive_auto_reply to 3. The agent gets exactly enough turns to diagnose, act, and report — then the conversation ends. Lesson: AutoGen and similar frameworks default to long conversations because they're built for chat use cases. For production, you want them to stop talking once the job is done. From Mohammad-Ali A'râbi, Docker Captain: The progression from 43% to 100% correctness reinforced a key lesson: production AI is often less a machine-learning problem; it is a systems engineering challenge. The initial failures were not the fault of the LLM; they were the result of implicit, undocumented policies and permissive execution environments. Production AI engineering requires moving past the "magic" of conversational models and returning to a rigorous, deterministic engineering discipline. It means treating the system prompt as an immutable policy file, writing explicit, boundary-defining rules that leave zero room for the model to improvise. It means enforcing aggressive Redis-backed rate limits to prevent hallucination loops, isolating execution tools to eliminate docker.sock vulnerabilities, and relying exclusively on structured JSON audit logs rather than plain text for forensic validation. The agent is merely a component. The surrounding infrastructure — the cryptographic constraints, the isolated execution environments, and the hardcoded fallbacks — is what actually makes the system safe. Building trust in AI demands the exact same rigor we apply to cluster security: trust nothing, verify everything, and strictly log the rest. Production Patterns We'd Recommend If you're building something similar with Docker MCP Gateway, here's what we'd carry over from our nine challenges: Authenticate every tool call, even in dev. We used HMAC signing on every request from agent to MCP server. The reason to do this early isn't just production security — it surfaces auth integration bugs during development, when they're cheaper to fix. Use structured JSON for audit logs, not text. The audit format we used (incident ID, agent ID, alert, decision chain, resolved flag) made it possible to write an analyzer that validates agent behavior automatically. Plain text logs would have made that impossible. Set rate limit low. We used Redis with 100 requests per hour per agent. Agents can make a lot of tool calls quickly — a single bug in the system prompt triggered thousands of calls in one of our early runs before we noticed. Default to escalation when uncertain. A false-positive escalation costs you a page that turns out to be nothing. A false-negative auto-fix can mask a real problem for weeks. The costs aren't symmetric, so the default shouldn't be either. Validate against expected behavior. Write down what you expect each failure mode to do, then write an analyzer that checks the audit log against that spec. We open-sourced ours — it's about 250 lines of Python, no external dependencies. You can adapt it to any agent that produces structured audit logs. Tighten conversation turn limits. max_consecutive_auto_reply=3 is a sane starting point for production. The agent should do its job and then the conversation should end. Frameworks default to longer because they're optimized for conversational AI demos, not production ops. What's Still Missing This article would be marketing if we didn't include this section. Honest engineering means owning what isn't built yet. No Docker Scout MCP server exists yet. Security-aware container discovery — "find the most secure nginx tag," "show me CVEs in this image" — isn't possible through MCP today. The Docker Hub MCP server has 13 tools, but none of them surface vulnerability data. This is a real gap in the ecosystem. No incident memory or pattern recognition. Our agent treats every incident as fresh. A production system would learn that this container OOMs every Tuesday at 4 pm and recommend a permanent memory increase rather than reactively bumping it each time. We've left this as future work. Sample sizes are small. Our 6 post-fix incidents prove the expected behavior is reproducible across the four scenarios. They don't prove reliability under production load, traffic spikes, or adversarial conditions. We'd need 100x more data and load testing to make those claims. MTTR is unmeasured. AutoGen records all decision-chain timestamps within microseconds of each other, so the per-incident duration data we collected isn't usable as a real mean-time-to-recovery metric. Capturing real MTTR would require external timing instrumentation around the agent. Gateway and tools are colocated. Our MCP server bundles the security pipeline (HMAC, rate limiting, audit) with the tool execution. In a true multi-tenant production setup, you'd separate these into distinct services so they can scale independently. Our current architecture is fine for a single team or environment; it would need refactoring before serving multiple agent populations. What This Means for AI Infrastructure The interesting part of building agentic infrastructure isn't getting the agent to act. It's getting it to not act when acting would make things worse. Docker MCP Gateway is one of the first production tools that takes this seriously — treating the infrastructure around the agent as the security layer, not the agent itself. The pattern we ended up with — a Gateway in front, scoped tools, decision boundaries written into the system prompt, structured audit logs — isn't novel. It's just what worked. We expect most production AI agents will end up looking similar, because this is what makes them debuggable when something goes wrong. The nine challenges we documented in the lab README are probably challenges you'll hit too. The analyzer script, the audit log format, and the validation patterns are all MIT-licensed in the companion repository. Use whatever's useful. This article was originally published on OpsCart.

By Mohammad-Ali Arabi

Selective Deployment in Azure Data Factory: A Practical Blueprint for Safer CI/CD

Picture this: two features are being developed in parallel. One has already been tested in lower environments, but is still awaiting business approvalThe other is fully validated and ready to go live Naturally, you want to release the second feature to production. But you can’t, because your deployment model forces you to release everything together. If you’ve worked with Azure Data Factory (ADF), this situation probably sounds familiar. Azure Data Factory (ADF) is a cloud-based data integration service from Microsoft that helps you build and orchestrate data pipelines across systems. It works extremely well for managing data workflows — but when it comes to deployments at scale, things get tricky. As our ADF usage grew across multiple teams and environments, we started running into a recurring problem: We had control over development — but very little control over what actually got deployedA simple pipeline fix could unintentionally introduce unrelated changesParallel feature development became harder to manageProduction releases became riskier than they needed to be That’s when we realized: The issue wasn’t ADF itself — it was the deployment model we were relying on. The issue wasn’t ADF itself — it was the deployment model we were relying on. This article walks through how we addressed that challenge by implementing a selective deployment pattern, allowing us to promote only intended changes without impacting everything else. The Real Problem: Parallel Feature Releases in ADF Before diving into the solution, let’s look at a scenario that frequently occurs in real-world teams. What This Diagram Represents This diagram shows two features progressing across environments: Feature 100 Developed earlier, successfully deployed to Dev and TestCurrently in UAT (User Acceptance Testing)Still awaiting business approval before production Feature 200 Developed later, successfully completed across Dev → Test → UATFully validated and ready for production Expected Behavior At this stage, the expectation is straightforward: “Let’s release Feature 200 to production.” Feature 100 is still under testing, so it should remain in UAT. What Actually Happens in ADF Azure Data Factory follows a full-state deployment model. That means when you deploy, you are not deploying a feature; you are deploying the entire factory state. So when you attempt to release Feature 200: Feature 100 gets included automaticallyYou cannot isolate Feature 200You lose control over what reaches production Why This Becomes a Real Problem This isn’t an edge case; it becomes a recurring pattern in larger environments. You’ll encounter this when: Multiple teams are working in parallelFeatures move at different speedsUAT cycles varyProduction fixes need to be released quickly It becomes even more complex when: Existing production pipelines are modifiedPartial updates are requiredDependencies overlap across features The Core Limitation: ADF promotes state, not intent. It does not differentiate between what is ready for production and what is still under testing. Why We Had to Rethink Deployment This limitation introduced real risks: Accidental promotion of incomplete featuresDelayed production releasesIncreased coordination overheadHigher chances of breaking stable pipelines We needed a way to: Promote only Feature 200Keep Feature 100 in UATAvoid impacting unrelated artifactsReduce production risk Architecture Overview To address this challenge, we introduced a selective packaging layer between build and deployment. Flow Feature Branch → PR → Validate → Selective Packaging → ARM Export → Incremental Deploy → Trigger Control Key Idea: Instead of exporting ARM templates from the full ADF repository, we export from a filtered staging folder containing only the required artifacts. Understanding Default ADF Deployment Behavior Before implementing selective deployment, it’s important to understand how Azure Data Factory works by default. ADF follows a full-state deployment model. How Default ADF Deployment Works When you use ADF with Git integration: Developers work in a collaboration branch (typically main)Changes are committed and merged via pull requestsADF provides a Publish button in the UI When you click Publish, ADF generates ARM templates representing the entire factory state. These templates are stored in the adf_publish branch: In modern setups, instead of clicking Publish manually, teams often use @microsoft/azure-data-factory-utilities (npm-based export). This allows pipelines to validate ADF resources and export ARM templates programmatically. YAML - name: Validate ADF resources run: | set -euo pipefail FACTORY_ID="/subscriptions/${{ env.SUBSCRIPTION_ID }/resourceGroups/${{ env.RESOURCE_GROUP }/providers/Microsoft.DataFactory/factories/${{ env.SOURCE_FACTORY_NAME }" npm run build validate "${{ github.workspace }" "$FACTORY_ID" YAML - name: Export ARM templates (CI publish) run: | set -euo pipefail FACTORY_ID="/subscriptions/${{ env.SUBSCRIPTION_ID }/resourceGroups/${{ env.RESOURCE_GROUP }/providers/Microsoft.DataFactory/factories/${{ env.DEV_FACTORY_NAME }" npm run build export "${{ github.workspace }" "$FACTORY_ID" "${{ env.ARM_OUTPUT_DIR }" Whether you click Publish manually or use npm export in CI/CD, the outcome is the same: Full factory deploymentNo control over individual featuresAll changes get bundled together Selective Deployment Layer (Core Design) We can address this requirement and the associated challenges by introducing a workflow driven by a manifest to define the deployment scope, and a program to identify all necessary ADF dependencies for each manifest file. As a developer, I can now control which release is promoted to production, without worrying about releasing any other features that are not ready. The manifest controls which pipelines to deploy and which optional categories to include. Below is an example of a manifest file JSON { "pipelines": ["pl_ingest_population_selective"], "includeTriggers": false, "includeIntegrationRuntimes": false, "includeAllGlobalParameters": true, "includeLinkedServices": true, "validateLinkedServicesExist": true, "includeManagedVirtualNetwork": false, "includeManagedPrivateEndpoints": false } Workflow Explanation Let's understand the crux of the selective deployment workflow now. I am working in the release branch on my feature branch directly in ADF Studio. Since ADF Studio is integrated with Git, my development changes will be saved to my branch. Here are the steps I can take to promote my change to a higher environment. 1) Validation of ADF on PR validation This is an early validation step and a guardrail: if the PR fails, it's because objects are invalid and misaligned. This is equivalent to the "validation all" button in the ADF ui, here is this workflow Trigger: Pull requests targeting the branch selective_deployment. Purpose: Validate that the ADF JSON in the PR is valid in the context of the target factory. Main steps: CheckoutSet up Node.js 20npm installAzure login using OIDC (azure/login@v2)Validate with ADF Utilities: YAML FACTORY_ID="/subscriptions/${AZURE_SUBSCRIPTION_ID}/resourceGroups/${AZURE_RESOURCE_GROUP}/providers/Microsoft.DataFactory/factories/${DEV_FACTORY_NAME}" npm run build validate "$GITHUB_WORKSPACE" "$FACTORY_ID" 2) Release build + selective deploy to DEV adf-release-build-selective-deploy.yml Triggers: Push to selective_deploymentManual run (workflow_dispatch) with optional manifest inputDefault: deploy/manifests/release.json This workflow has two jobs: Job A: adf-build (staging + export + sanitize + artifacts) Checkout (full history)Azure login using OIDCSet up Node.js 20Install build dependencies inside build/ (npm install in build)Stage selective subset python scripts/select_adf_subset.py <manifest>, a code snippet below for the complete script, refer to the GitHub repository link given Python import json import re import shutil import sys from pathlib import Path from typing import Dict, Set, Tuple, List from collections import defaultdict # Your repo layout has pipeline/, dataset/, linkedService/ at ROOT. REPO_ROOT = Path(".") STAGE_ROOT = Path("build/adf_subset") RESOURCE_DIRS = { "pipeline": REPO_ROOT / "pipeline", "dataset": REPO_ROOT / "dataset", "linkedService": REPO_ROOT / "linkedService", "dataflow": REPO_ROOT / "dataflow", "trigger": REPO_ROOT / "trigger", "integrationRuntime": REPO_ROOT / "integrationRuntime", "credential": REPO_ROOT / "credential", "managedVirtualNetwork": REPO_ROOT / "managedVirtualNetwork", } # Copy these if present so ADF utilities behave the same on staged subset. ROOT_FILES_TO_COPY = [ "publish_config.json", "arm-template-parameters-definition.json", "arm_template_parameters-definition.json", "package.json", "package-lock.json", ] Produces: build/adf_subset/ (staged tree)build/adf_subset_report.json (dependency report)Refer to logs below (showing output of stage selective subset and debug to view output generated after select_adf_subset.py )Export ARM templates from the staged subset via ADF Utilities: npm --prefix build run build -- export "adf_subset" "$FACTORY_ID" "ArmTemplate"Produces: build/ArmTemplate/ARMTemplateForFactory.jsonbuild/ArmTemplate/ARMTemplateParametersForFactory.jsonStrip infra-owned resources scripts/strip_arm_resources.py to produce a safe template: build/ArmTemplate/ARMTemplateForFactory.safe.json⚠️ Note on Infrastructure Components (Refer to the “Future Work & Next Steps” section for follow-up topics in this series) The step above intentionally strips infrastructure-dependent components from the generated subset to avoid overwriting existing shared resources such as linked services. This implementation focuses on developer-owned artifacts (pipelines, datasets, and triggers) and assumes that infrastructure components — such as Integration Runtimes, managed private endpoints, and linked services — are pre-provisioned and managed outside of this deployment workflow.Upload artifacts: ARM templates (adf-arm)metadata (adf-release-meta)subset report (adf-subset-report) Job B: deploy_dev (deploy safe template) Download ARM artifactAzure login using OIDCEnsure az Data Factory extension is installedValidate JSON files exist/parseDeploy via azure/arm-deploy@v2(Incremental) to DEV RG/factory: Template: ARMTemplateForFactory.safe.jsonParameters: ARMTemplateParametersForFactory.json + factoryName=<DEV_FACTORY_NAME> Lesson Learned Setting up selective deployment in ADF was more than a technical task. It made us rethink our approach to deployments, ownership, and CI/CD design. Here are the main things we learned: 1. The Problem Is Not Tooling; It’s Deployment Granularity At first, we thought the limitation came from the tools we used, like UI publish or npm export. However, both methods yielded the same result: full factory templates. The real problem was that we couldn’t control the scope of deployments, not how the templates were made. 2. Dependency Awareness Is Critical Selective deployment only works when every dependency is found and included. We learned that: Pipelines often reference multiple datasets and linked services. Missing even one dependency results in deployment failure You must automate dependency discovery. 3. “Incremental” Is Often Misunderstood Incremental deployment is important, but it doesn’t work like a patch. It reapplies the full configuration for all included resources. This means: Your generated templates need to be complete for all the artifacts you include. If you use partial definitions, deployments can fail. 4. Separation of Concerns Is Key Not all ADF artifacts are the same. We began to separate them into different groups: Application-owned artifacts: pipelines, datasets, triggers Infrastructure-owned artifacts: linked service, managed virtual networks, managed private endpoints, and integration-runtime, among others. This separation proved crucial for safe, scalable deployments. 5. Selective Deployment Adds Complexity, But It’s Worth It It’s true that implementing this approach brings in additional scripts, manifest management, and CI/CD complexity. But in exchange, we gained precise control over releases, reduced production risk, and faster hotfix deployments. Future Work and Next Steps While selective deployment solved a major gap in ADF CI/CD, it also opened up new areas for improvement and standardization. 1. Defining Infrastructure vs Application Ownership One of the biggest follow-up areas is clearly defining ownership boundaries. In our experience: Application teams should own pipelines, datasets, and triggers Platform or infrastructure teams should own linked services, managed virtual networks, and managed private endpoints, among other things. Future work can focus on: Enforcing this separation in CI/CD. Preventing accidental deployment of infrastructure components Integrating Terraform or platform pipelines for infrastructure provisioning 2. Governance Around Linked Services Linked services are often shared across multiple pipelines and teams. Future improvements include: Centralizing linked service management Using Key Vault and Managed Identity consistently Preventing direct modifications through application pipelines

By Sauhard Bhatt

What Cloud Engineers Actually Need to Know About AI Infrastructure

When I decided to move into AI infrastructure, nobody warned me that I had to relearn how to think about compute. I proceeded with the usual steps, such as spinning up VMs, configuring networking, and managing costs. But then a moment came, and I watched, slightly horrified. I misconfigured the inter-node networking. The result was that an eight-node GPU ran a training job at just 11% GPU utilization. It was a wake-up call for me. AI workloads aren’t just different in a marketing sense. They’re different where it counts, i.e., in the architecture — how you build and run things. The ML engineers on that project immediately assumed the model was the problem. They decided to redesign the model and spent a couple of days tweaking the architecture, like chasing a ghost. The real issue resurfaced only when someone checked the network telemetry — the cluster nodes were using standard Ethernet, not InfiniBand. The model had no issues. The infrastructure configuration was incorrect. After years of working with Azure and a period on AWS before that, I wish someone had given me a cheat sheet before starting that project. Compute: Breaking Down the Model Many cloud engineers assume that AI infrastructure requires larger VMs: more cores and more memory, and the workload will run. This approach is insufficient. While right-sizing CPUs remains relevant, it now accounts for only about 20% of considerations. The remaining 80% is driven by GPUs, which operate fundamentally differently from CPUs and significantly impact the infrastructure. A GPU isn’t just a faster CPU; it's a collection of thousands of smaller cores working together to handle large datasets. If any part of your system—such as storage speed, network bandwidth, or data preprocessing—can't keep up, the GPU remains idle, incurring huge unwanted costs. On Azure, idle GPUs cost as much as active ones. Usually, the main limitation in AI infrastructure isn't the GPU itself, but the upstream systems that supply data to it. When working with Azure, you'll mostly use two main GPU families. The NC-series gives you a single A100 per VM at about $3.60 per hour on demand, making it the go-to choice for fine-tuning and inference tasks. The ND-series has eight A100S that are connected through NVLink and InfiniBand, which is perfect for distributed training. If your cluster uses regular Ethernet instead of InfiniBand between nodes, inter-GPU bandwidth can drop by 60 to 70 percent, and Azure may not warn you about this. It’s smart to double-check that your cluster is set up with InfiniBand before starting a multi-node run and to make sure your GPU quota is ready ahead of time. Storage: Where Training Jobs Are Exhausted When you’re training a language model, expect to chew through the dataset over and over — think of it as laps around a track, not a sprint. If you try to pipe 500GB of text straight from regular Azure Blob Storage, you’ll quickly find yourself staring at a progress bar that barely budges. Each blob tops out at about 60 megabytes per second, but an A100 GPU can eat data for breakfast at several gigabytes per second. There’s a massive mismatch. If you want to keep your GPUs busy (and not just waiting around), you’ll need something beefier — Azure Managed Lustre fits the bill, since it can dish out data to your training jobs at speeds regular storage can’t dream of. I’ll admit, the first time I ran into this, I wasted hours on model tweaks before realizing the bottleneck was staring me in the face the whole time. Model checkpoints are a cost trap that is often overlooked. A single checkpoint for a 7B parameter model is around 28GB. Saving checkpoints every 30 minutes over 72 hours generates more than 4TB of data. Configure a Blob lifecycle policy before you start to avoid unexpected storage costs. Networking: Two Problems, One Person Responsible During training, each GPU shares gradient updates with the others in the cluster via AllReduce. The efficiency of the cluster is directly determined by the bandwidth and latency of this communication. If this communication is disrupted, GPU utilization drops. Machine Learning teams often attribute this to model architecture issues, such as an excessive number of parameters or an incorrect batch size, but the network is usually the cause. First, assess network performance and address any issues before the job runs to avoid unnecessary model design, as ML engineers may not consider this when monitoring loss curves. The second networking problem is well known among cloud engineers. Many enterprise clients in financial services and healthcare require AI services that avoid the public internet. Azure AI services, such as Azure OpenAI, Azure ML, and Azure AI Search, all support Private Link, and the configuration process is identical to that of other PaaS services. The key consideration is to integrate private endpoint DNS zones with existing private DNS or manage them manually. ML engineers may interpret a generic “connection refused” error caused by an incorrect DNS configuration as an API issue. Both inter-GPU bandwidth and private network isolation — critical infrastructure concerns — typically fall under the same person’s responsibility. The Azure AI Services Stack: Known Infrastructure, Unknown Branding Recent Azure services such as OpenAI Service, Machine Learning, and AKS with GPU node pools might sound new, but for most infrastructure teams, the actual work remains familiar. The phrase “managed service” sometimes suggests that everything is taken care of, but in reality, only the AI model is managed. Everyday responsibilities like network security, permissions, cost tracking, and system monitoring still rest with your team, no matter how polished the portal looks. Azure OpenAI Service works much like other managed API endpoints, supporting private connections, role-based access, managed identities, and API Management for controlling usage rates. The main distinction is its use of Provisioned Throughput Units (PTUs) — these reserve GPU resources to guarantee performance. If you see HTTP 429 errors, it’s almost always a sign of resource bottlenecks rather than issues in your code, although the latter is a common assumption. Azure Machine Learning sits on top of other infrastructure stacks, such as Blob Storage, ACR, Key Vault, and compute, which you already manage. The failure mode is unique to Azure ML: the compute cluster lifecycle. Ensure clusters auto-scale to zero when idle. Unfortunately, this is not the default setting. When a bill arrives with huge costs due to a cluster running overnight because of an unset idle timeout, everyone looks to the cloud engineer first. While it’s tempting to go with Azure Container Apps for their apparent simplicity, most real-world inference workloads ultimately end up on AKS with GPU node pools. The reason? Container Apps are easy—that is, until you’re hit with cold start lag during actual user traffic and realize spinning up a GPU container on the fly just isn’t fast enough to meet your SLA. With AKS, you get far more say over things like keeping node pools warm, tuning autoscaling, and controlling scheduling—options that simply aren’t available with Container Apps. Costs: Higher Stakes, Faster Exposure Eight GPUs on an ND-series cluster aren’t cheap — about $27 an hour adds up quickly. A few long training runs and you’re already close to $2,000, and if you’re running a batch of experiments, $20,000 can disappear before anything launches. The price tag often slips by until accounting points it out. When models underperform, it’s easy to blame the architecture, but I’ve learned to glance at GPU usage first. If you’re seeing less than 60% during distributed runs, chances are the bottleneck is in the infrastructure, not the model itself. If you want to slash costs, spot VMs can drop your bill by as much as 90%. The catch? Your training jobs must be able to handle abrupt interruptions—so regular checkpointing and clean restarts are a must. If that’s not in place, spot isn’t the way to go—sort it out with your ML team before finance starts asking questions. Reserving GPU resources is a whole different equation than CPUs: GPU supply changes from region to region, and with how quickly AI hardware evolves, locking in a three-year reservation on today’s gear is a real gamble. Security: Same Toolkit, New Attack Surface For AI projects, you still need the basics like private networks, Managed Identity, strong RBAC, and encryption. But now there’s a twist: prompt injection. It’s like the old trick with SQL injection, but for language models. Someone might simply ask a chatbot to show its system prompt. If you haven’t set up protections, it could actually answer. Firewalls won’t help here. Azure Content Safety can block some of these risky requests, but most teams don’t use it until after trouble starts. If you’re in a regulated industry, logging every inference is a must. In finance or healthcare, you need to record inputs, outputs, who did what, and when, so auditors have all the details they need. Decide on your schema and retention policy before going live, because adding it later, after compliance comes calling, is always a headache. The ML engineers on these teams know the models well. But when infrastructure acts up, causing higher costs, slowdowns, or new risks, they're often the last to spot the cause. Closing that gap is the real challenge. For cloud engineers, "architecturally different" isn’t a red flag; it’s a chance to improve.

By Naveen Kalapala

A Tool Is Not a Platform (And Your Team Knows the Difference)

Most infrastructure teams have a moment where someone says “we should build a platform.” The motivation is real: teams are duplicating work, the current setup is hard to use consistently, and a more structured approach would help. A few months later, the platform is a Terraform module collection, a GitLab CI template, a shared repository of scripts, and a README that several people have tried to keep current. That is a useful thing. It is not a platform. The distinction is worth being clear about, not to dismiss the work, but because the word “platform” creates expectations. When internal teams hear “we have a platform,” they assume stability, a usable interface, a versioning model, and some mechanism for raising problems when things break. A toolchain with documentation does not deliver those things by default. What Makes Something a Platform A platform is defined by its contract, not its technology. The contract describes what the consumer can expect: what they call, what parameters they provide, what outputs they receive, and what stability guarantees apply to that interface. A Terraform module with a published interface is closer to a platform primitive than a pipeline that provisions the same resources through environment variables, undocumented flags, and positional arguments. The module has a contract. The pipeline has a process. The contract does not have to be formal. It needs three things. A stable surface. Consumers should be able to call the same interface next month and receive the same type of result. Internal changes to how it works do not break consumers.A versioning model. When the interface changes, that change is communicated, and consumers are not silently broken. A git tag is enough to start with. Semantic versioning is better.A feedback path. Consumers can report when the contract is violated or the interface does not behave as documented. Someone is responsible for responding. A Terraform module with these three properties is a platform primitive. A set of modules with a shared versioning model, a stable registry entry, and a team responsible for maintaining the contract is starting to look like a platform. What Teams Actually Experience The gap between a toolchain and a platform shows up in how teams actually use it. With a toolchain, onboarding a new team means pointing them at the repository and telling them to read the README. Anything not in the README requires asking someone who has been around for a while. Changes to the toolchain break existing consumers silently because there is no versioning model. The team that maintains the toolchain treats every consumer as having kept up with the latest state of the repository. With a platform, onboarding means pointing teams at interface documentation with a working example. Changes go through a version increment. Consuming teams that pin to a version are not broken by changes they did not ask for. Plain Text # Consuming a module with a pinned version module "vm" { source = "registry.example.com/hybridops/vm/proxmox" version = "~> 2.1" name = "web-01" cores = 2 memory = 4096 } This looks like a small detail. For teams consuming infrastructure modules across a growing estate, it is the difference between a managed dependency and a shared folder everyone is afraid to touch. When a Toolchain Is the Right Call Not every infrastructure system needs to be a platform. A toolchain is appropriate when the team is small and holds the full mental model, the surface area is limited, and the rate of change is low enough that everyone stays current without a formal versioning model. When those conditions hold, the overhead of maintaining a platform contract is not justified. The problem is not having a toolchain. The problem is calling it a platform when it is not, and then finding that the expectations it created are not being met. Teams told they have a stable platform, then hit with a broken workflow from an unannounced change, lose confidence quickly. That confidence is hard to rebuild. HybridOps has been working in this space: publishing Terraform modules to a registry, versioning releases, and treating module interfaces as contracts. It is not a finished platform. It is a direction, and being explicit about that direction changes how the work gets done. A Simple Test If a consuming team pins to the current version of your toolchain today, will it still work in three months without any changes on their side? If you cannot answer yes with confidence, you have a toolchain, not a platform. Both are useful. Only one creates the kind of trust that makes a growing engineering organisation move faster rather than slower. Knowing which one you have is the first step toward building the right one.

By Jeleel Muibi

No VIP? No Problem: Pacemaker-Based SAP HANA High Availability Using a Load Balancer Health Check

High availability is a non-negotiable requirement for mission-critical SAP HANA deployments. When a primary database node goes down without an automated failover in place, the business impact is immediate. RHEL Pacemaker has long been the standard cluster manager for SAP HANA High Availability(HA) on Linux; it detects failures, fences misbehaving nodes, promotes secondaries, and orchestrates the full recovery sequence without manual intervention. The standard Pacemaker playbook for SAP HANA HA, as documented in the official documentation, relies on a virtual IP address (VIP) as the single stable network endpoint for all database traffic. Pacemaker keeps that VIP tied to whichever node is currently the active primary. When a failover happens, the VIP moves. Applications reconnect to the same address and reach the new primary without configuration changes. The problem is that this approach breaks down on many cloud platforms. Hyperscalers and private cloud environments frequently do not support traditional floating VIPs in the way bare-metal or on-premises networking does. The official RHEL Pacemaker documentation covers the VIP setup in detail and stops there. When VIPs are not available, practitioners are left to work out an alternative on their own. This article defines a production-ready alternative for exactly this scenario. The approach replaces the floating VIP with a network load balancer (NLB) and uses a Pacemaker-managed health check listener to tell the load balancer which node is the active primary at any given time. This article explains the problem, positions it against existing cloud provider approaches, and walks through the implementation step by step. How Cloud Providers Address This The challenge of replacing a floating VIP with a load balancer while still routing traffic exclusively to the active HANA primary is not new. There is published guidance on how to approach, and the core pattern is consistent across all of them. One such approach is to use an internal passthrough Network Load Balancer alongside a socat-based health check listener managed as a Pacemaker resource. The listener opens on a dedicated port in the private range (49152–65535), and the NLB probes that port to determine which backend is the primary. The approach uses the Open Cluster Framework(OCF) 'anything' resource agent to manage the socat process inside Pacemaker. The second approach is to use an Internal Load Balancer with a health probe on port 625XX (where XX is the HANA instance number). A listener on each HANA node responds to the probe, but only the primary has the listener active. In some configurations, HAProxy is used rather than socat as the listener. The implementation discussed in this article adds to this landscape a clean approach using a native systemd service registered directly as a Pacemaker resource instead of the OCF 'anything' agent or HAProxy, and it targets RHEL specifically. The systemd approach keeps the setup self-contained, auditable, and consistent with how most RHEL administrators already manage services. It works on any cloud provider or private cloud environment that supports network load balancers. Architecture Overview The diagram below shows the two-node SAP HANA cluster, the network load balancer, and how the health check listener connects them. The NLB's backend pool includes both HANA nodes on the standard HANA port (3XX15), but the health probe targets a separate port, 62500, that only the active primary exposes. Overall cluster architecture The NLB sees both nodes as members of its backend pool. Because only the primary node has anything listening on port 62500, the NLB marks the secondary as unhealthy for routing purposes and sends all traffic to the primary. When Pacemaker promotes the secondary during a failover, it starts the listener on the new primary as part of the same orchestration sequence. The NLB detects the change on its next health check cycle and shifts all traffic accordingly. Failover Sequence The diagram below shows the sequence of events from the moment the primary node fails to the moment applications reconnect through the load balancer. Failover sequence from node failure to reconnection Two timing factors govern the total recovery window. The first is Pacemaker's fencing and promotion sequence, typically 30 to 90 seconds, depending on the STONITH method and HANA replication state. The second is the NLB health check interval, which determines how quickly the load balancer detects the new primary after Pacemaker completes its promotion. For production environments, tuning both values together is worth the effort Pacemaker Resource Model The diagram below maps the Pacemaker resource hierarchy and constraints used in this setup. Understanding the resource model helps clarify why both the colocation and ordering constraints are necessary. The colocation constraint (score=INFINITY) tells Pacemaker that lb_healthcheck must always run on the same node as the promoted HANA primary. If the promoted primary moves, the health check listener moves with it. The ordering constraint ensures the listener does not start until HANA has fully completed its promotion, preventing the load balancer from routing traffic to a node that is still finishing its takeover sequence. Prerequisites The following must be in place before starting the implementation: Two RHEL virtual servers with access to the Red Hat High Availability Add-On repositorySAP HANA installed on both servers with HANA System Replication configuredPacemaker installed and configured through section 5.7 of the official Red Hat SAP HANA HA guide, sections 5.8 and 5.9 (virtual IP configuration) are intentionally skippedA network load balancer provisioned with both HANA nodes in the backend pool, backend port set to 3XX15 (where XX is the HANA instance number)socat installed on both HANA nodesFirewall rules permitting TCP traffic on port 62500 from the NLB health check source addresses socat is available in standard RHEL repositories. Install it with: sudo dnf install socat -y Step-by-Step Implementation Step 1: Create the Systemd Health Check Service Run the following command on both HANA nodes. It creates a systemd unit file that uses socat to open a TCP listener on port 62500. The listener accepts any connection and returns success immediately; that response is all the load balancer needs. Shell cat <<EOF > /etc/systemd/system/lb-healthcheck.service [Unit] Description=LB healthcheck listener for active SAP HANA primary After=network-online.target Wants=network-online.target [Service] Type=simple ExecStart=/usr/bin/socat TCP4-LISTEN:62500,reuseaddr,fork EXEC:/bin/true Restart=always RestartSec=2 [Install] WantedBy=multi-user.target EOF Do not enable this service manually. Pacemaker will control its lifecycle entirely. Step 2: Reload Systemd After writing the unit file, reload systemd on both nodes so it registers the new service: Shell systemctl daemon-reload Step 3: Prevent the Service From Starting Automatically Explicitly disable and stop the service. If both nodes have the listener running simultaneously, the load balancer will consider both healthy and will route traffic to either node, which defeats the entire purpose of the setup. Shell systemctl disable lb-healthcheck systemctl stop lb-healthcheck Step 4: Create the Pacemaker Resource Register the systemd service as a Pacemaker-managed resource. From this point forward, Pacemaker owns the start, stop, and monitoring of the listener. Shell pcs resource create lb_healthcheck \ systemd:lb-healthcheck \ op monitor interval=10s timeout=20s Pacemaker will now monitor the listener every 10 seconds and automatically relocate it during failover events. Step 5: Add the Colocation Constraint This is the constraint that enforces the listener always runs on the same node as the promoted SAP HANA primary. Without it, Pacemaker might place the resource on either node. Shell pcs constraint colocation add lb_healthcheck \ with Promoted cln_SAPHanaCon_P01_HDB01 \ score=INFINITY Replace P01_HDB01 with the actual SID and instance number for the environment. For example: if SID is PRD and instance number is 00, use PRD_HDB00 Step 6: Add the Ordering Constraint The ordering constraint prevents the health check listener from starting until after the HANA promotion is fully complete. Without this, a race condition could cause the load balancer to route traffic to a node that is still mid-promotion. Shell pcs constraint order promote cln_SAPHanaCon_P01_HDB01 \ then start lb_healthcheck Step 7: Validate the Pacemaker Configuration Verify that both constraints are correctly registered in the cluster: Shell pcs constraint config The output should contain both of the following entries: Plain Text Colocation Constraints: Started resource 'lb_healthcheck' with Promoted resource 'cln_SAPHanaCon_P01_HDB01' score=INFINITY Order Constraints: promote resource 'cln_SAPHanaCon_P01_HDB01' then start resource 'lb_healthcheck' Step 8: Verify Listener Placement Confirm that only the active primary node is listening on port 62500. Run this command on each node: Shell ss -lntp | grep 62500 On the primary node, the output should show a LISTEN entry on 0.0.0.0:62500. On the secondary node, the command should return nothing. Plain Text # Expected on PRIMARY node: LISTEN 0 5 0.0.0.0:62500 0.0.0.0:* # Expected on SECONDARY node: # (no output) If both nodes show the listener, the colocation constraint is either missing or incorrect. If neither node shows it, check that the HANA clone resource is in the Promoted state with: pcs status Comparison: VIP Approach vs. NLB Health Check Approach The diagram below summarizes the trade-offs between the traditional VIP approach and the NLB health check approach described in this article. Comparison The VIP approach cuts over faster because there is no dependency on an external health check interval. The IP simply moves to the new primary node. It requires the underlying network to support IP address mobility, which cloud environments typically do not. The NLB approach works across any cloud or private cloud environment that supports network load balancers. The trade-off is that traffic cutover depends on the NLB's health check interval in addition to Pacemaker's promotion time. The cloud documentation on major cloud providers acknowledges this trade-off explicitly: using an NLB with a health check listener is their recommended approach for all SAP HANA HA deployments, and they provide the same socat-based pattern using the OCF 'anything' resource agent. The approach documented here achieves the same outcome using a systemd service, which many operators find more familiar and easier to audit. Operational Notes and Tuning A few things are worth keeping in mind when running this setup in production. NLB health check interval: The faster the health check interval, the shorter the window between Pacemaker completing its promotion and the NLB redirecting traffic. A 5-second interval is common in Cloud SAP HA documentation. Setting this too low can cause false positives during normal HANA replication lag. STONITH configuration: This solution assumes STONITH (fencing) is configured as part of the base Pacemaker setup. Without STONITH, Pacemaker will not promote the secondary during a primary failure. STONITH ensures the failed node is definitively powered off before promotion proceeds, preventing split-brain. Port 62500 vs. 625XX convention: Cloud providers use the convention 625XX (where XX is the instance number) for their SAP HANA health check ports. Cloud's documentation recommends using any port in the private range 49152 to 65535. Port 62500 used in this setup falls within that range and does not conflict with standard HANA ports. Teams following other cloud provider conventions can substitute 625XX if they prefer consistency across environments. Testing failover: After setup, the full failover sequence should be tested by killing the primary HANA process (not the OS) and verifying the NLB redirects traffic to the new primary within the expected time window. The pcs status command is the primary tool for watching the Pacemaker side of the transition. Conclusion The standard RHEL Pacemaker documentation for SAP HANA HA assumes a virtual IP is available. Not all hyperscalers provide VIP. The solution fills that gap cleanly: replace the VIP with a network load balancer hostname, and use a Pacemaker-managed socat listener to tell the load balancer which node is the primary at any given time. The core pattern NLB health probe targeting a Pacemaker-owned listener is the same pattern major cloud providers use in their own SAP HA documentation. What this implementation adds is a clean systemd service approach for RHEL, without needing the OCF 'anything' resource agent or additional proxy software. The setup comes down to eight steps: write a systemd service, disable it from auto-starting, register it as a Pacemaker resource, and add two constraints. The constraints — one for colocation, one for ordering — are what tie the listener's lifecycle to the HANA primary promotion sequence and make the whole thing work reliably across failovers. For teams running SAP HANA on RHEL in environments where VIPs are not an option, this is a production-ready path forward that relies entirely on standard RHEL tooling.

By Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE

Implementing Asynchronous Communication Between Microservices Using Kafka and Spring Boot

In a microservices system, that tight coupling turns a small hiccup into a cascading slowdown. Thread pools fill, retries amplify traffic, and suddenly your simple request is blocked on half the fleet. My executive summary: asynchronous messaging with Kafka helps systems keep moving when individual components inevitably slow down or fail. It does this by decoupling producers from consumers, absorbing traffic spikes, and allowing services to evolve without tying their availability directly to one another. Code Patterns in Spring Boot With Kafka Spring for Apache Kafka gives me two primitives that feel pleasantly old Spring KafkaTemplate for sending and @KafkaListener for receiving. That template/listener model is intentionally similar to other Spring integration tech, which keeps application code focused on domain logic instead of raw client plumbing. Below is a compact (but production-shaped) pattern: externalized config via @ConfigurationProperties, a service port for publishing, a REST command endpoint, a consumer with a real error strategy (DLT), and a REST error advice. Java // === Messaging config (externalized, type-safe) === @ConfigurationProperties(prefix = "messaging.orders") @Validated record OrdersMessagingProps( @NotBlank String topic, @NotBlank String dltTopic ) {} // === DTO (event contract) === public record OrderCreatedEvent(UUID orderId, UUID userId, BigDecimal total, Instant createdAt) {} // === Service port (keeps domain testable, Kafka swappable) === public interface OrderEventPublisher { void publishOrderCreated(OrderCreatedEvent event); } // === Adapter: Kafka producer === @Component class KafkaOrderEventPublisher implements OrderEventPublisher { private final KafkaTemplate<String, OrderCreatedEvent> template; private final OrdersMessagingProps props; KafkaOrderEventPublisher(KafkaTemplate<String, OrderCreatedEvent> template, OrdersMessagingProps props) { this.template = template; this.props = props; } @Override public void publishOrderCreated(OrderCreatedEvent event) { // Keying by orderId keeps per-order ordering and drives partitioning decisions. template.send(props.topic(), event.orderId().toString(), event); } } // === REST command API (synchronous edge, async core) === @RestController @RequestMapping("/v1/orders") class OrdersController { private final OrderService orderService; // domain port OrdersController(OrderService orderService) { this.orderService = orderService; } @PostMapping public ResponseEntity<Map<String, Object>> create(@Valid @RequestBody CreateOrderRequest req) { UUID orderId = orderService.create(req.userId(), req.total()); // persists + publishes event return ResponseEntity.accepted().body(Map.of("orderId", orderId, "status", "ACCEPTED")); } record CreateOrderRequest(@NotNull UUID userId, @NotNull @Positive BigDecimal total) {} } // === Domain service port (implementation can use outbox, transactions, etc.) === public interface OrderService { UUID create(UUID userId, BigDecimal total); } // === Consumer: downstream service reacts to events === @Component class BillingListener { @KafkaListener(topics = "${messaging.orders.topic}", groupId = "${spring.kafka.consumer.group-id}") void onOrderCreated(OrderCreatedEvent event) { // Idempotency belongs here: process-by-key + store processed eventId/orderId to avoid duplicates. // Do work (charge card, create invoice, etc.) } } // === Kafka consumer error handling: retries + DLT === @Configuration class KafkaErrorHandlingConfig { @Bean DefaultErrorHandler defaultErrorHandler(KafkaTemplate<Object, Object> template, OrdersMessagingProps props) { var recoverer = new DeadLetterPublishingRecoverer(template, (rec, ex) -> new TopicPartition(props.dltTopic(), rec.partition())); // Backoff and retry policy are configurable; keep it finite to avoid poison-pill loops. return new DefaultErrorHandler(recoverer, new FixedBackOff(1000L, 3)); } } // === REST error handling (ProblemDetail) === @RestControllerAdvice class ApiErrors { @ExceptionHandler(IllegalArgumentException.class) @ResponseStatus(HttpStatus.BAD_REQUEST) ProblemDetail badRequest(IllegalArgumentException ex) { var pd = ProblemDetail.forStatusAndDetail(HttpStatus.BAD_REQUEST, ex.getMessage()); pd.setTitle("Invalid request"); return pd; } } A few been-burned-before notes on the code above. Spring Kafka’s reference docs are explicit that KafkaTemplate is the convenience wrapper for producing, and DefaultErrorHandler + DeadLetterPublishingRecoverer is a first-class way to route failed records to dead-letter topics after retries. If we want non-blocking retries, Spring Kafka also provides @RetryableTopic, which orchestrates retry topics and a DLT automatically useful when transient failures are common and you want predictable retry delay semantics. Containers and Local Dev With Docker Compose When I’m chasing down event flow bugs, I like local environments that feel like the old days: one command, deterministic startup order, and no mystery dependencies. Docker Compose is still the quickest way to stand up Kafka alongside your services, and Confluent publishes straightforward Docker-based tutorials and compose examples for running Kafka locally. For the service image itself, multi-stage builds are the modern classic compile in a builder stage, and copy the artifact into a slimmer runtime stage. Docker documents multi-stage builds as a way to reduce the final image contents and keep build dependencies out of production. Dockerfile # Multi-stage Dockerfile for a Spring Boot service (orders-service) FROM eclipse-temurin:21-jdk AS build WORKDIR /workspace COPY mvnw pom.xml ./ COPY .mvn .mvn RUN ./mvnw -q -DskipTests dependency:go-offline COPY src src RUN ./mvnw -q -DskipTests package FROM eclipse-temurin:21-jre WORKDIR /app COPY --from=build /workspace/target/*.jar app.jar EXPOSE 8080 ENTRYPOINT ["java","-jar","/app/app.jar"] And here’s a Compose file that wires up Kafka and Schema Registry, plus an example Spring Boot service. The exact image choices are illustrative. Your production choices are unspecified and should reflect your standards and security posture. YAML # compose.yaml (local/dev) services: zookeeper: image: confluentinc/cp-zookeeper:7.6.0 environment: ZOOKEEPER_CLIENT_PORT: 2181 kafka: image: confluentinc/cp-kafka:7.6.0 depends_on: [zookeeper] ports: ["9092:9092"] environment: KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092,PLAINTEXT_HOST://localhost:9092 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 schema-registry: image: confluentinc/cp-schema-registry:7.6.0 depends_on: [kafka] ports: ["8081:8081"] environment: SCHEMA_REGISTRY_HOST_NAME: schema-registry SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: PLAINTEXT://kafka:9092 orders: build: ./orders-service depends_on: [kafka] ports: ["8080:8080"] environment: SPRING_KAFKA_BOOTSTRAP_SERVERS: kafka:9092 MESSAGING_ORDERS_TOPIC: orders.events MESSAGING_ORDERS_DLTTOPIC: orders.events.dlt SCHEMA_REGISTRY_URL: http://schema-registry:8081 Deploying on Kubernetes or AWS On AWS, the Kafka decision is usually managed or self-managed. If you choose Amazon MSK, the cluster lives in your VPC, pick subnets across distinct Availability Zones, and connect clients using the cluster’s bootstrap brokers. That’s the networking baseline, and it’s not optional. MSK is VPC-first by design. For authentication/authorization, MSK supports IAM access control. AWS documents the client configuration for IAM mechanisms. In EKS, I typically pair MSK IAM with IRSA so pods can obtain AWS credentials the AWS way, while ECS services would use task roles instead. Both patterns are documented by AWS, and your choice here is unspecified. Kubernetes service discovery is usually the easy part. Services and Pods get DNS names so workloads can call each other by name rather than IP. Kafka itself is reached via bootstrap broker endpoints or via internal Services, but either way, you want the strings in externalized config, not hardcoded. Here’s a minimal Kubernetes Deployment/Service for a Kafka client service. Values like region, account IDs, and MSK endpoints are unspecified placeholders. YAML apiVersion: apps/v1 kind: Deployment metadata: name: orders namespace: apps spec: replicas: 2 selector: matchLabels: { app: orders } template: metadata: labels: { app: orders } spec: serviceAccountName: orders-sa # IRSA-bound (role ARN unspecified) containers: - name: orders image: <UNSPECIFIED_AWS_ACCOUNT_ID>.dkr.ecr.<UNSPECIFIED_REGION>.amazonaws.com/orders:<TAG> ports: [{ containerPort: 8080 }] env: - name: SPRING_KAFKA_BOOTSTRAP_SERVERS value: "<UNSPECIFIED_MSK_BOOTSTRAP_BROKERS>" - name: MESSAGING_ORDERS_TOPIC value: "orders.events" - name: MESSAGING_ORDERS_DLTTOPIC value: "orders.events.dlt" readinessProbe: httpGet: { path: /actuator/health/readiness, port: 8080 } initialDelaySeconds: 10 --- apiVersion: v1 kind: Service metadata: name: orders namespace: apps spec: selector: { app: orders } ports: - port: 80 targetPort: 8080 Operationally, MSK exposes metrics into CloudWatch (AWS/Kafka), and broker logs can be delivered to CloudWatch Logs (or S3/Firehose). That combination gives you the classic visibility loop: throughput, lag, under-replicated partitions, and error logs without running your own monitoring plane. For distributed tracing in async flows, OpenTelemetry is my default vocabulary now. Spring Boot supports OpenTelemetry export via OTLP, and OpenTelemetry defines Kafka semantic conventions so your producer/consumer spans and attributes stay consistent across tools. CI/CD and the Hard-Earned Field Notes For CI/CD, I keep it boring: build once, push an immutable image, deploy via a declarative mechanism. AWS Prescriptive Guidance provides a clear GitHub Actions pattern for building Docker images and pushing to Amazon ECR, which is a solid baseline when your region/account is unspecified until configured. YAML # .github/workflows/orders.yml name: orders on: push: branches: ["main"] jobs: build_push_deploy: runs-on: ubuntu-latest permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - uses: actions/setup-java@v4 with: distribution: temurin java-version: "21" - name: Build & test run: ./mvnw -q test package - name: Configure AWS credentials (OIDC) uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::<UNSPECIFIED_AWS_ACCOUNT_ID>:role/<UNSPECIFIED_GHA_ROLE> aws-region: <UNSPECIFIED_REGION> - name: Login to ECR run: | aws ecr get-login-password --region <UNSPECIFIED_REGION> \ | docker login --username AWS --password-stdin <UNSPECIFIED_AWS_ACCOUNT_ID>.dkr.ecr.<UNSPECIFIED_REGION>.amazonaws.com - name: Build & push image run: | IMAGE=<UNSPECIFIED_AWS_ACCOUNT_ID>.dkr.ecr.<UNSPECIFIED_REGION>.amazonaws.com/orders:${{ github.sha } docker build -t $IMAGE ./orders-service docker push $IMAGE - name: Deploy to EKS (example) run: | aws eks update-kubeconfig --name <UNSPECIFIED_EKS_CLUSTER> --region <UNSPECIFIED_REGION> kubectl -n apps set image deploy/orders orders=$IMAGE Now, the part I wish someone had handed me in 2016: Kafka gives you strong tools, but it does not remove distributed-systems truths. You still need safeguards on the consumer side: idempotent processing, disciplined schema management, and clearly defined retry and dead-letter topic behavior. Kafka’s documentation is careful about the limits of “exactly once” guarantees. Idempotent producers and transactions can strengthen delivery semantics, but achieving true end-to-end exactly-once behavior, especially when external side effects are involved, still depends on deliberate system design. For schema governance, Kafka itself doesn’t ship a schema registry, but acknowledges third-party registries; in practice, Confluent Schema Registry and Apicurio Registry are common choices. Both store schemas out-of-band, so messages carry only a schema identifier, and both support evolvable contracts across Avro/JSON Schema/Protobuf depending on your ecosystem. Conclusion and Best Practices If you take one lesson from my legacy brain into modern event-driven systems, let it be this: asynchrony is a reliability feature, not a performance trick. Kafka’s durable log and consumer group model decouples uptime and absorbs spikes, but you only get the real benefit when you treat schemas as contracts, consumers as idempotent processors, and failure handling as first-class application behavior. On AWS, the operational baseline is non-negotiable. MSK lives in your VPC across AZ subnets, clients connect via bootstrap brokers, IAM auth is configured explicitly, and observability lives in CloudWatch. Do those fundamentals early, and Kafka stops feeling like a mysterious black box and starts feeling like the dependable workhorse it was built to be.

By Mallikharjuna Manepalli

I Built a VS Code Extension to Debug Azure AI Foundry Agents Without Leaving My Editor

The Problem Azure AI Foundry has a genuinely great portal. You can see your agent runs, the tools it calls, the messages it sends and receives, and even a breakdown of token usage — all in a clean UI. But here's what actually happens when you're building an agent locally: Write some code, trigger a runSwitch to the browser, open the Foundry portalNavigate to your project → your agent → Traces tabFind the right runClick through to see what happenedSwitch back to VS Code to make a fixRepeat That context switch sounds minor. But when you're iterating fast — tweaking a system prompt, adjusting tool call logic, debugging why an agent handed off to the wrong sub-agent — it adds up. You're constantly pulling your attention out of your editor and into the browser and back again. What I wanted was simple: see the trace right where I'm working. What Foundry Trace Inspector Does The extension connects to your Azure AI Foundry project and gives you three views for every agent run, all inside a VS Code panel: Trajectories: The Full Span Tree A Gantt-style collapsible tree showing the full execution: Session → Invoke Agent → Chat turns → Tool calls. Every span shows duration, token counts, and cost. Click any span to open a detail drawer with the model, status, token breakdown, and raw input/output. Duration Per-span timing bars — see exactly how long each step took. Tokens Input vs output token breakdown per span. This is the view I use most during debugging. At a glance, I can see: did the tool call happen? How long did it take? What did the LLM actually receive as input? User View: Readable Conversation Replay A chat-bubble timeline of the full conversation: user messages and assistant replies rendered the way a human reads them, with the agent name and model on each assistant turn. Each assistant bubble has a "View Trace" button that jumps directly to the corresponding response in the sidebar — so you can go from "something looked off in this reply" to the raw span in one click. Token and Cost Chart A stacked bar chart (input vs output tokens per LLM turn) so you can instantly spot which turns are burning the most tokens — useful when you're trying to understand why a multi-turn conversation is getting expensive. Per span cost breakdown for both input and output tokens consumed. How It Works Under the Hood Azure AI Foundry agents use the OpenAI Responses API internally. Every agent reply produces a resp_... response ID that's visible in the Foundry portal's Traces tab. The extension fetches those responses directly via the same API and reconstructs the full conversation timeline locally. When a session spans multiple turns, each response links to the previous one via previous_response_id. Load any response in the chain and the extension walks the chain automatically — you don't need to manually track down every ID. Conversation IDs (conv_...) are discovered automatically from your saved responses, so once you track one response, the whole conversation surfaces. No intermediate server. The extension makes API calls only to the Azure endpoint you configure. Your API key is stored in VS Code's encrypted SecretStorage — it never touches settings.json and never leaves your machine. Setting It Up You need two things: An Azure AI Foundry project endpoint URL (found in the Foundry portal under your project → Overview)Either an API key or Azure CLI auth (az login) via DefaultAzureCredential Once configured, grab a conv_... conversation ID from the portal's Traces tab, paste it into the sidebar, and the extension fetches all responses in that conversation automatically. What's Next A few things I want to add in v0.2: Auto-discovery of recent runs – instead of pasting IDs manually, list recent conversations directly from the panelSide-by-side diff – compare two runs of the same agent to see what changed between runsExport to Markdown – generate a readable trace report you can paste into a PR or incident note Further Reading What is Foundry Agent Service? – official overview of the service this extension connects toUse the Azure OpenAI Responses API – the underlying API the extension fetches trace data fromMicrosoft Foundry Pricing – understand what your agents actually cost to runVS Code Webview API – how the timeline panel is builtVS Code Extension API – full reference if you want to contribute or build on top of this

By Jubin Abhishek Soni

CORE

Automating Power Automate: How to Ensure Cloud Flows Are Active After Every Pipeline Deployment

You've spent hours — maybe days — building and testing a Dynamics 365 Power Platform solution. Your Azure DevOps pipeline runs clean. The managed solution imports successfully into the target environment. All green. Then the business calls. Nothing is working. The automations aren't firing. You log into Power Automate in the target environment and find the same scene every time: every single cloud flow is turned off. Not broken. Not errored. Just off. And every connection reference is sitting there unresolved, pointing at nothing, waiting for someone to manually wire it up. If your solution has 5 flows, that's annoying. If your solution has 50 or 100 flows, that's a half-day of manual work — clicking into each flow, assigning the connection, saving, turning it on, and moving to the next one. In a team doing frequent releases across multiple environments (Test, UAT, Production), this compounds quickly. It turns what should be a 10-minute deployment into an hours-long chore, introduces human error, and makes your pipeline feel like it only does half the job. This is one of the most common pain points in Power Platform DevOps, and it's almost never solved cleanly out of the box. This article explains exactly why it happens and how to fix it so that flows are on, connections are wired, and the environment is fully operational the moment the pipeline finishes. Why This Happens Understanding the root cause is important because there are actually three separate things that go wrong, and you need to address all three. 1. Flows Are Exported in Whatever State They Were In When a Power Platform solution is exported from a source environment, every cloud flow is embedded in the solution package in its current state. If a flow was turned off in the source environment at the time of export — even briefly, for testing or debugging — it ships in that state. When the managed solution is imported into the target environment, the flow arrives and stays off. There is no automatic activation step built into the standard import process. 2. Connection References and Actual Connections Are Different Things This is the conceptual point that trips up most teams new to Power Platform ALM. A connection is the actual authenticated link to a service — a specific Dataverse instance, an Outlook mailbox, a SharePoint site. Connections are environment-specific, created manually or via admin tools, and they live outside any solution. They should never be part of a solution package. A connection reference is a pointer. It's a solution component that says "this flow uses a connection of type X." The connection reference lives inside the solution, travels with it across environments, and is what the flow binds to at runtime. The connection reference itself has no credentials — it just points to whichever actual connection in the environment is assigned to it. The correct setup is: In the source environment (DEV): The actual connections exist and are assigned to the connection references. The solution contains only the connection references, not the connections themselves. In the target environment (Test, UAT, Production): The actual connections are pre-created by an administrator and given appropriate access. The service principal used by the pipeline to deploy the solution must have read/write access to these connections. When the solution is imported, the deployment settings file maps each connection reference in the solution to the correct pre-existing connection in that environment. If this mapping is not done correctly, flows that depend on unresolved connection references will remain in a draft state after import, regardless of any other settings. 3. The Standard Import Task Does Not Activate Flows Power Platform Build Tools for Azure DevOps includes an ActivatePlugins flag on the import task. Despite what the name implies, this activates Dataverse plugins and custom workflow activities only — it has no effect on Power Automate cloud flows. There is no built-in flag on the standard import task that activates cloud flows. This means that even a perfectly configured import, with all connection references resolved and all tokens substituted, will still leave flows in a deactivated state unless you add an explicit activation step. Prerequisites: What Must Be in Place Before the Pipeline Runs Before the pipeline can solve this problem end-to-end, two things must be true in every target environment. First, the actual connections must already exist. For every service your flows connect to — Dataverse, Outlook, SharePoint, Teams, or any other connector — an administrator must have already created a connection in the target environment. These connections are not part of the solution and should never be included in the solution export. They are environmental infrastructure, created once and maintained independently of deployments. Second, the service principal must have access. The Azure Active Directory app registration used as the service principal for the pipeline (the account that authenticates the import) must be granted access to read and write in the target environment. This includes having sufficient Dataverse security roles and, where applicable, being designated as an owner or co-owner of the connections so that the deployment settings file can map connection references to those connections during import. Once these are in place, the pipeline can take over the rest automatically. The Pipeline Approach The pipeline is split into two phases: a build phase that runs against the source environment and packages the solution, and a release phase that deploys the packaged solution to each target environment. Build Phase The build phase exports the managed solution from the source environment and generates a DeploymentSettings.json file. This file is the key to automating connection reference mapping. It is generated by the PAC CLI from the solution ZIP and contains a structured list of every connection reference and environment variable in the solution. Out of the box, the generated file has empty ConnectionId fields. The build pipeline post-processes this file by replacing those empty fields with placeholder tokens in the format @@token_name@@. For example, a connection reference with the logical name shared_commondataserviceforapps becomes @@shared_commondataserviceforapps@@ in the deployment settings file. The file is then published as part of the build artifact. The critical point is that connection reference logical names often include a random trailing suffix added by the platform (e.g., shared_commondataserviceforapps_8ca1f). The build script normalizes these by stripping the suffix, so the token is deterministic and consistent across builds. Release Phase The release pipeline picks up the build artifact for each environment stage and runs the following sequence: Step 1: Replace Tokens A token replacement task reads DeploymentSettings.json and substitutes each @@token@@ with the corresponding pipeline variable for that stage. The pipeline variables for each stage hold the actual connection IDs of the pre-existing connections in that environment. For example, the Test stage has a variable shared_commondataserviceforapps with the value of the Dataverse connection ID in the Test environment. After this step, the deployment settings file is fully resolved with no remaining placeholders. Step 2: Import Solution The Power Platform Import Solution task imports the managed solution ZIP using the resolved DeploymentSettings.json. This wires up every connection reference to its corresponding connection in the environment automatically, with no manual intervention. Step 3: Activate Flows This is the step that closes the gap. A PowerShell task runs after the import and queries the Dataverse API for all cloud flows in the solution that are not currently active. It then activates each one programmatically. The Activation Script This PowerShell script uses the Dataverse Web API, authenticated with the same service principal credentials used by the rest of the pipeline. It queries specifically for Modern Flow entities (category eq 5) in the target solution and activates any that are in a stopped or draft state. PowerShell # Activate all Power Automate cloud flows in the solution post-import $tenantId = "$(TenantId)" $clientId = "$(CRMClientId)" $clientSecret = "$(CRMClientSecret)" $environmentUrl = "$(CRMEnvironmentUrl)" $solutionName = "$(CRMSolutionName)" # Obtain an OAuth token for the Dataverse API $tokenUrl = "" $tokenBody = @{ grant_type = "client_credentials" client_id = $clientId client_secret = $clientSecret resource = $environmentUrl } $tokenResponse = Invoke-RestMethod -Method Post -Uri $tokenUrl -Body $tokenBody $token = $tokenResponse.access_token $headers = @{ "Authorization" = "Bearer $token" "OData-MaxVersion" = "4.0" "OData-Version" = "4.0" "Content-Type" = "application/json" } $apiBase = "$environmentUrl/api/data/v9.2" # Query for cloud flows (category = 5) in this solution that are not Active (statecode != 1) $queryUrl = "$apiBase/workflows?`$filter=category eq 5 and statecode ne 1" + " and _solutionid_value eq (select solutionid from solutions where uniquename eq '$solutionName')" + "&`$select=workflowid,name,statecode,statuscode" $flows = (Invoke-RestMethod -Uri $queryUrl -Headers $headers).value if ($flows.Count -eq 0) { Write-Host "All flows are already active. Nothing to do." } else { Write-Host "Found $($flows.Count) flow(s) to activate." foreach ($flow in $flows) { $patchUrl = "$apiBase/workflows($($flow.workflowid))" $payload = @{ statecode = 1; statuscode = 2 } | ConvertTo-Json Invoke-RestMethod -Method Patch -Uri $patchUrl -Headers $headers -Body $payload Write-Host "Activated: $($flow.name)" } Write-Host "Done. $($flows.Count) flow(s) activated successfully." } Note on category eq 5: Power Platform stores multiple automation types in the same workflow entity. Category 0 is classic workflows, category 4 is business process flows, and category 5 is Modern Flow (Power Automate cloud flows). The filter ensures only cloud flows are touched. Guarding Against Unresolved Connections A common failure mode is a new connection reference being added to the solution in DEV, the build generating a new @@token@@ for it, but the corresponding pipeline variable not being added to the release stage yet. The import will succeed, but the flow that depends on that connection will remain inactive — and the activation script will fail to activate it because the connection reference is still unresolved. To catch this early, add a validation step before the import that checks for any remaining @@token@@ placeholders in the deployment settings file and fails the pipeline immediately if any are found: PowerShell $settingsPath = "$(System.DefaultWorkingDirectory)/$(Build.DefinitionName)/$(Build.BuildNumber)/DeploymentSettings.json" $content = Get-Content $settingsPath -Raw $unresolved = [regex]::Matches($content, '@@[^@]+@@') | Select-Object -ExpandProperty Value if ($unresolved.Count -gt 0) { Write-Error "Unresolved connection tokens found in DeploymentSettings.json:`n$($unresolved -join "`n")" Write-Error "Add the missing pipeline variables for this stage and re-run." exit 1 } Write-Host "All connection tokens resolved. Proceeding with import." Failing fast here is far better than a silent partial deployment where some flows activate, and others don't. The Result Once this is in place, the deployment experience changes completely. A pipeline run that previously required a human to log into each target environment, open Power Automate, navigate to each flow, assign connections, and manually toggle flows on — a process that scales linearly with the number of flows — becomes fully automated. For a solution with 100 cloud flows deploying across three environments, that might be 300 individual manual actions eliminated per release cycle. The environment is fully operational the moment the pipeline is completed. No follow-up tickets. No forgotten flows. No production incidents because someone missed one. The key insight is that the platform gives you all the pieces — connection references for portability, deployment settings for environment-specific mapping, and the Dataverse API for programmatic activation — but it does not wire them together for you automatically. Once you do, your Power Platform deployments become as reliable and hands-off as any other enterprise application deployment.

By karthik nallani chakravartula

The Latency Tax That’s Hidden in Cloud-Native Systems (and the Hard Lessons I Learned to Minimize It)

Let’s be real, shall we? Do you remember the early days of our cloud-native promise? We dove in headfirst, building microservices by breaking apart monolithic applications and starting to deploy to the cloud with all sorts of containers. We had unlocked the secret of scaling and resiliency, it seemed. And we had! But wait... wasn’t it? The first time I faced a real perplexing (remember these are my lessons learned, and I murdered more than a few prior to finding the right way) performance issue, I will not forget. Our services ran fast on their own. Oh, and our code was pristine. Well, sort of. Our users were complaining about how bogged down they were. Our dashboards were stating a sea of green, but something smelled really bad. Several days after an intense investigation, we figured it out. Really, it was death by a thousand cuts, not even a bug to be found. This invisible performance tax was a cost to be considered in our solidly (and lightly) constructed architecture, which was giving us a hard time. We were suffering from the latency tax. No story about broken systems today. Invisible friction is built into our modern distributed systems, and that is what the tax is. Tax is what you pay for the privilege of going to the cloud. I want to talk to you about this tax today. What is it? Where is it? And how can you architect towards it with a lower cost? What Is The "Latency Tax"? Let’s get to the heart of the matter: latency is not just the time it takes for your services to execute a database query. In the cloud native world, it is the sum total of every single handshake, hop, and translation that a request has to make as it works its way through your ecosystem. Take a simple user request, e.g., “load my profile.” In a monolith, this could be as few as one or two hops. In our shiny new microservices world, it could look something like this, conceptually: Plain Text You (The Client) ↓ (10 ms) API Gateway ↓ (3 ms) Service Mesh Sidecar ↓ (25 ms) User Profile Service ↓ (15 ms) Database ↓ (100 ms) External Email Service ↓ And back again... Can you see that? Even in a “healthy” system, there is a chain of delays like that. At a small enough scale, they amount to milliseconds, which we cannot see. But at millions of requests per second? These amount to seconds of delay, broken SLAs, and frustrated users. That’s the tax. And the tax man always collects. Where Is That Tax Hidden? Let's Audit the System So where do these hidden costs come from? Let's investigate the biggest performance offenders. I promise, once you know what to look for, you'll see them everywhere. The Network Hop: It’s All About Geography Every time one service talks to another, that is a network round trip. It seems to be instantaneous, but physics is a cruel mistress. A call from a service in us-east-1 to a database in eu-west-1 is traveling thousands of miles. You can't beat the speed of light. My favorite fix: Co-locate your services! Get the talking parts as close together as you can, ideally in the same availability zone. For service-to-service communication internal to your system, ask yourself: "Does this really have to go through the public internet?" The Serialization Slog: JSON Is Not Free People We love JSON because it is human-readable. Your servers? Not so much. Parsing and reparsing of JSON is costly in terms of compute. Now imagine a single request payload that gets serialized and deserialized at the gateway, then again at the service mesh, and again at your microservice, etc. You are paying a parsing tax at every border. My bete noire: For internal communications, interface your external services with binary protocols such as gRPC with Protocol Buffers. The difference is stark. Let me give you a quick comparison. A simple REST/JSON payload might look something like this: JSON { "userId": 123456, "userName": "Jane Doe", "email": "[email protected]" } The same data, when defined with the gRPC interface, is much more efficient: ProtoBuf message User { int32 user_id = 1; string user_name = 2; string email = 3; } The binary form is smaller and far faster at encode/decode. We noticed a 60%–70% reduction in latency after this change in our internal services. It is transformative. The Cold Start Chill: The Serverless Paradox Serverless is great for cost efficiency. That first request to a new function instance? Well, it has to wake up, which takes hundreds of milliseconds. That’s a huge spike in your P99 latency. My go-to fix: For latency-sensitive paths, use provisioned concurrency. It keeps a number of instances warmed up and ready to go. For those functions that are not so latency-sensitive, a simple warmer cron job will keep them from getting completely cold. The Observability Overhead: When Watching Costs You This one hurt. We brought on all of the monitoring tools available. Distributed tracing, custom metrics, verbose logging. Our observability was excellent, but we had seen a latency increase of almost 10%. Every log line means a bit of overhead, and it adds up fast. My go-to fix: Be smart and lean. Use sampling. There is no need to trace every request. Ship your logs asynchronously, and batch your metric updates. Ask yourself if you really need to collect that metric, and if so, do you need it now? When 1 + 1 = 3: The Multiplicative Effect of Microservices Here’s the change in mental models that changed everything for me. We tend to think memory latency is linear. But when you have a distributed system and have fan-out, it’s multiplicative. Imagine that Service A has to call Services B, C, and D in parallel to satisfy a request. What happens if Service B itself has to call E and F? Now, a delay in any of those things would not just add to the overall memory latency, but could result in blocking the entire orchestration. The thought of 99% reliable service sounds great, but if you have ten of them chained together, your overall reliability drops to (0.99)^10 or about 90%. Now do this for latency. Scary yes? How to counter: This is where things like the Aggregator (an API composition layer) and Circuit Breakers become important. The Aggregator pulls together a number of small calls and allows the client to avoid calling all of those other things. The circuit breaker ensures that a slow dependency won’t take your entire system down. It’s the whole notion of the bulkheads to stop the leaking. Accelerating Systems: A Playbook for the Fast Good. Now that you know the problems, how do you get to good? How do you create systems that are fast by design? 1. Data Locality Policies The compute should be close to the data. If you have a Lambda function talking to DynamoDB, make sure it’s in the same AWS region. Better yet, make it in the same availability zone. Every unnecessary mile adds latency. A millisecond per mile. 2. Cache, Cache, and More Cache I’ve become passionate about caching. I’m not just talking about caching API calls further. Authentication tokens: Validate a JWT once and cache it for a few seconds. Database connection pools: Reuse the connections. Never open a new DB connection per request.Static config: For example, if your service reads its configurations from S3 at startup, cache them in memory. A 5ms saving on a call that is made 10 times in each request saves you 50ms. That is huge! 3. Fail Fast: Timeout Fast This is just as much a cultural change as it is a technical one. Set aggressive, sane timeouts on all external calls. If a dependency hasn't responded in 500ms, it probably isn't going to respond. You shouldn't wait for the full 30-second default timeout. Use a circuit breaker to do your fail fast and give a fallback (even if it is a degraded experience). A fast "sorry" is better than a slow maybe. 4. Go Asynchronous Wherever Possible Not every operation requires immediate feedback. What about "Order shipped", or "welcome" emails, or data gratification for reports? Decouple these flows using messaging systems (SQS, RabbitMQ) or event streams (Kafka, Kinesis). This makes the main user-facing flows incredibly fast and also helps to make the overall system more resilient. The Most Important Metric You Are Probably Ignoring If you take only one thing from this piece, let it be this: **Stop looking at average latency!** The average is a lie that hides your worst user experience. What you need to care about are the outliers: the 95th (P95) and 99th (P99) percentiles. Let me give you an actual example from my past: P50 (Median) latency: 120ms – "Looks great!"P95 latency: 650ms – "Uh oh."P99 latency: 1500ms – "We have a problem." This is the P99 group - the 1% of your users experiencing multi-second latencies is experiencing terrible experiences and are highly likely to churn. You now need distributed tracing (like Jaeger or AWS X-Ray) to understand the why of those specific requests being slow. The Tax Reduction Cheat Sheet layerthe hidden taxrefund instructions API Gateway Routing & Auth Cost Skip for Internal Traffic Networking Interregion Hops Co-locate Service Serialization JSON Costs Use gRPC/Protobuf Security TLS Handshake Time Reuse Sessions & Conns Serverless Cold Starts Provision Concurrency Observability Logging & Tracing Cost Sample Database Slow Queries & Hotspots Cache Aggressively & Paginate It Is a Design Problem, Not a Bug Getting to low-latency cloud-native systems is not about finding a single magical Go function that can be written better. It is a fundamental shift in how we look at designs. Instead of just writing fast code, we must get to writing low-friction architectures. All additional services, all additional sidecars, and every gateway have a trade-off. That trade-off of advantage must be well balanced against the added time and latency. The trick is continuing to ensure that every millisecond of latency that is introduced must be made up for with a disproportionately large advantage gained in resilience, scale, or other functionality. So the next time that you are designing a system... I want you to ask yourself this question: “For every millisecond of latency imposed, what is the advantage that the user is going to gain?” If you can't answer that question, it probably means that a fresh start is needed. The taxman is always there to collect the tax. But by good design, we can ensure that we are only going to be paying for what we NEED rather than what we went looking for. Frequently Asked Questions Q1. Should I just go back to a monolith to avoid this? Answer: Not necessarily! Monoliths have their own scaling and deployment problems. We don’t want to avoid microservices, but to use them more intelligently. If you have lots of small services and discover they are giving you more pain than gain, consider a modular monolith or larger, better-defined “macro” services instead. Q2. Is gRPC always better than REST? Answer: In terms of service-to-service internal communication, almost always. REST/JSON has its place for outward-facing APIs, though, as it is universally accepted and easily debuggable. You can live in a hybrid mode. Q3. How much observability is enough? Answer: This is a fine balancing act. You need enough observability to be able to ascertain the production issues rapidly, but not so much that performance is impaired. Start with strong metric and error log facilities, and once they are giving you useful data, add in sampled distributed traces for the more complicated workflows. Never let an urge to exhaustively collect your data determine your aim here; let your specific needs govern it. Q4. Our P99 is high, but we don’t know where to start! What is the first thing to do? A) Implement distributed tracing. This is not negotiable for modern systems. This will give you a visual picture of the complete lifecycle of a slow request and exactly what service or network call is the bottleneck. You cannot fix what you cannot see.

By Bharath Kumar Reddy Janumpally

Why Infrastructure Efficiency Is Becoming the New Cloud Profitability Metric

Infrastructure efficiency is rapidly becoming one of the most important factors determining profitability for cloud providers, managed service providers, and SaaS companies. For years, infrastructure growth followed a simple formula: add more servers, more storage, and more capacity whenever demand increased. That model worked when hardware prices consistently declined, and inefficiencies could be absorbed through growth. Those conditions no longer exist. Today, providers face rising costs for memory, enterprise SSDs, GPUs, power, cooling, and colocation, while customers continue to expect lower pricing, better performance, stronger SLAs, and faster service delivery. Several industry shifts have fundamentally changed infrastructure economics. Changes in virtualization licensing models have increased costs for many organizations. AI adoption has driven demand for GPUs, high-capacity memory, and high-performance storage. Power and colocation costs continue to rise globally, while sovereign cloud initiatives are creating demand for regional infrastructure that must compete economically with hyperscale cloud providers. The challenge is clear: infrastructure costs are rising faster than revenue. What Does a Workload Really Cost? Infrastructure efficiency ultimately comes down to a simple question: what does it cost to deliver a workload? Customers do not buy servers, storage systems, or software licenses. They buy virtual machines, Kubernetes clusters, databases, AI environments, SaaS applications, and business services. The true cost of delivering those workloads includes much more than infrastructure hardware: Software licensingPower and coolingColocationNetwork connectivityStorageCapacity buffersStaffing and operationsSupport and SLA commitments The providers that achieve the lowest cost per workload while maintaining performance and service quality gain a significant competitive advantage. As infrastructure costs continue to increase, "cost per workload delivered" is becoming a useful framework for evaluating efficiency. Unlike traditional metrics focused solely on hardware utilization or licensing costs, this approach considers the complete economics of delivering customer-facing services. Beyond Infrastructure Utilization Infrastructure efficiency is not measured only by CPU, memory, or storage utilization. Operational metrics often have an equally significant impact on the cost of delivering workloads. Examples include administrator-to-server ratio, administrator-to-VM ratio, workload deployment times, incident resolution times, and the number of infrastructure platforms that must be maintained. Cost alone is also a misleading metric. A workload delivered at lower cost may also deliver lower performance, higher contention, or slower support response times. A virtual machine with two vCPUs does not necessarily provide the same amount of usable compute across platforms. CPU oversubscription ratios, noisy-neighbor effects, storage latency, network performance, and support commitments all influence the actual customer experience. The relevant metric is not simply cost per workload, but cost per workload delivered at a defined SLA. Architectural Choices and Efficiency Infrastructure architecture plays a major role in determining workload economics. Traditional infrastructure environments often combine separate virtualization, storage, networking, monitoring, backup, and orchestration platforms. While this approach offers flexibility, it can also increase operational complexity, encourage overprovisioning, and create management overhead. As a result, many organizations are moving toward more integrated infrastructure models, including hyperconverged infrastructure (HCI) and software-defined platforms that consolidate multiple functions into a unified operational framework. The goal is not merely consolidation. The real objective is to reduce operational overhead, improve resource utilization, simplify scaling, and lower long-term total cost of ownership. This becomes particularly important for sovereign cloud initiatives. Unlike hyperscalers that benefit from massive global scale, regional cloud providers often need to achieve competitive economics within a specific country or market while maintaining local data residency, compliance, and operational control. In these environments, maximizing infrastructure efficiency is often critical to long-term profitability. Infrastructure Efficiency Metrics Worth Tracking Organizations evaluating infrastructure efficiency should look beyond traditional utilization metrics and monitor indicators that directly affect workload economics, including: Cost per virtual machineCost per containerCost per Kubernetes clusterCost per AI workloadStorage efficiency ratiosPower consumption per workloadAdministrator-to-server ratioWorkload deployment timesMean time to resolution (MTTR)Resource utilization across compute and storage environments These metrics provide a more accurate view of infrastructure performance than hardware utilization alone. Why AI Changes the Equation The emergence of AI workloads has made infrastructure efficiency even more important. GPU resources are expensive, but GPUs alone do not determine the economics of AI infrastructure. Storage performance, networking efficiency, workload orchestration, and operational processes all directly impact GPU utilization and overall service profitability. In many environments, the challenge is no longer acquiring GPUs. It ensures that the surrounding infrastructure can keep them fully utilized. As GPU, storage, and power costs continue to rise, organizations are increasingly focused on maximizing the value extracted from every infrastructure resource. AI infrastructure economics are becoming less about acquiring the largest amount of hardware and more about achieving the highest utilization and operational efficiency from existing investments. Measuring Infrastructure Economics One of the challenges with infrastructure efficiency is that it often remains invisible until it is measured. Many organizations focus on software licensing when evaluating infrastructure costs, but licensing is only one part of the equation. Utilization rates, storage efficiency, operational overhead, power consumption, hardware refresh cycles, staffing requirements, and SLA commitments often have a much greater impact on long-term economics. This is why Total Cost of Ownership (TCO) modeling is becoming increasingly important. Effective infrastructure evaluations should account for: Software costsHardware acquisitionEnergy consumptionColocation expensesStorage efficiencyStaffing requirementsOperational complexitySupport and maintenance costs Organizations that perform these broader analyses often discover that the greatest opportunities for savings come not from individual licensing decisions but from improving overall workload economics. Conclusion The next phase of cloud infrastructure optimization is unlikely to be driven by capacity growth alone. As infrastructure costs continue to rise and customer expectations continue to increase, providers must focus on delivering more workloads with fewer resources while maintaining performance and service quality. In that environment, infrastructure efficiency becomes more than a technical objective. It becomes a business metric. The organizations that can achieve the lowest cost per workload delivered at a defined service level will be best positioned to protect margins, remain competitive, and build sustainable cloud and AI services for the future.

By Tetiana Fydorenchyk

Cloud Architecture

DZone's Featured Cloud Architecture Resources

Top Cloud Architecture Experts

The Latest Cloud Architecture Topics