DZone Spotlight

Thursday, April 30 View All Articles »

Agent Skills Explained for Developers

By Pavan Belagatti

CORE

Agent skills are suddenly everywhere in the AI engineering world, and for good reason. They solve a very real problem: AI agents may be smart, but they still know nothing about your organization unless you explicitly teach them. They do not automatically understand your internal workflows, your service catalog, your production readiness rules, or the exact steps needed to fix recurring issues. That is where agent skills come in. They give your AI agent reusable knowledge, structured instructions, and workflow-specific context so it can do meaningful work instead of acting like a generic chatbot with tool access. If you have been hearing about skills.md files, MCP servers, Claude, Copilot, and custom agent workflows, this is the missing mental model. Once you get it, the whole ecosystem makes a lot more sense. Why Agent Skills Are Getting So Much Attention One quick way to understand whether a concept matters is to look at search interest. The term agent skills has been climbing fast, especially in recent months. That is usually a sign that people are not just curious, they are actively trying to use something in real projects. And it makes sense. This is not a niche concept only for AI researchers. Developers, platform teams, engineering managers, and AI engineers can all benefit from it because agent skills increase both capability and efficiency for AI agents. A lot of the early buzz is tied to Claude because Anthropic introduced the concept as an open standard. But the idea is bigger than one model or one company. The important part is that a skill can travel across platforms, which makes it much more useful than a one-off prompt hidden in one tool. How We Got Here: From Function Calling to MCP to Agent Skills To really understand agent skills, it helps to place them in the broader evolution of AI agents interacting with the outside world. 1. Function Calling The first big step was function calling, also known as tool calling. This was when large language models started invoking external tools through a predefined JSON schema. A classic example is something like get weather data for a city. That was useful, but it had clear limitations: Manual wiring everywhere. Every function had to be described and connected by hand.Error handling was your job. If something failed, the system did not really know how to recover intelligently.Scaling was painful. Every new capability increased developer overhead. So, function calling gave models access to tools, but not much autonomy or reusable workflow intelligence. 2. Model Context Protocol (MCP) Then came the Model Context Protocol, or MCP. This made it much easier to connect AI agents to external tools and data sources through a standard protocol. The easiest way to think about MCP is as a USB-like standard for AI systems. Instead of custom integrations for every tool, you get a cleaner, more interoperable plug-and-play model. That is why so many companies are now building MCP servers for their own systems and workflows. MCP was a major leap because it standardized access. But access alone is not enough. 3. Agent Skills This is where agent skills become important. If MCP gives your AI agent access to external tools and data, agent skills teach the agent what to do with those tools and data. That is the core idea. Instead of giving an agent only tool access, you package repeatable workflows, domain knowledge, trigger conditions, and repair playbooks into reusable skill files. The agent can then reason through a task in a more structured and specialized way. Each stage in this evolution shifts more agency from the developer to the system: Function calling gave the model tool access.MCP standardized access to tools and data.Agent skills gave the model reusable capability and workflow intelligence. That is why this feels like a truly agentic progression. What Agent Skills Actually Are Agent skills are folders of instructions and supporting files that package a repeatable workflow, specialized knowledge, or a new capability for your AI agent. On the surface, that might sound like saved prompts. But they are more than that. A good skill does not just store text. It defines: When the skill should activateWhat the agent should do step by stepWhat reference data the agent should useWhat remediation playbooks or actions it can follow So instead of copy-pasting a giant prompt every time you want an agent to do something specialized, you write that capability once and reuse it across sessions and tools. This is exactly what makes agent skills powerful. They turn a general-purpose model into something much closer to a reliable specialist. Example: Custom Deployment Skill Here’s an example of a custom skill for deploying services in your organization: Plain Text { "identifier": "deploy-to-production", "title": "Deploy to Production", "properties": { "description": "Guide for deploying services to production. Use when users ask to deploy, release, or promote a service to production.", "instructions": "# Deploy to Production\n\nFollow these steps to deploy a service to production:\n\n## Step 1: Verify prerequisites\n\n- Check that all tests pass.\n- Verify the service has a production-readiness scorecard score above 80%.\n- Confirm the service owner has approved the deployment.\n\n## Step 2: Run the deployment\n\nExecute the deployment action for the target service and environment.\n\n**Example input:**\n- Service: `payment-service`\n- Environment: `production`\n\n**Expected output:**\n- Deployment initiated successfully.\n- Action run ID returned for tracking.\n\n## Step 3: Verify deployment\n\n- Check the action run status.\n- Verify the service is healthy in production.\n- Monitor for any alerts in the first 15 minutes.\n\n## Common edge cases\n\n- If tests are failing, do not proceed with deployment.\n- If scorecard score is below threshold, recommend remediation steps first.\n- If deployment fails, check logs and suggest rollback if needed.", "references": [ { "path": "references/deployment-runbook.md", "content": "# Deployment Runbook\n\n## Pre-deployment checklist\n\n- [ ] All CI checks pass\n- [ ] Code review approved\n- [ ] QA sign-off received\n\n## Rollback procedure\n\nIf deployment fails:\n1. Revert to previous version\n2. Notify on-call team\n3. Create incident ticket" }, { "path": "references/common-errors.md", "content": "# Common Deployment Errors\n\n## ImagePullBackOff\nCause: Container registry authentication failed.\nFix: Verify registry credentials.\n\n## CrashLoopBackOff\nCause: Application fails to start.\nFix: Check application logs and configuration." } ], "assets": [ { "path": "assets/deployment-config.yaml", "content": "apiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: {{ service_name }\nspec:\n replicas: 3\n strategy:\n type: RollingUpdate" } ] } } The Open Standard Behind Agent Skills Agent Skills were originally created by Anthropic and released as an open standard on December 18, 2025, along with the specification and SDK. The standard is now governed as a cross-platform specification at agentskills.io. The practical implication is huge. A skill created for Claude is not trapped inside Claude. The same skill can work across multiple AI platforms that adopt the standard, including tools like OpenAI Codex, Gemini CLI, GitHub Copilot, Cursor, VS Code, and others. That portability is what makes this more than another product feature. It is infrastructure for reusable agent behavior. Why LLMs Need Agent Skills in the First Place LLMs are great at general conversation, brainstorming, and broad reasoning. But when workflows become complex, they often become inconsistent. They forget details, miss edge cases, or answer too generically because they do not have the right context. This becomes painfully obvious in cases like: Analyzing internal service healthUnderstanding organization-specific scorecardsApplying a company’s engineering rulesGenerating precise remediation stepsWorking across tools like GitHub, issue trackers, and internal platforms Agent skills help bridge that gap. They move the model from passive chat behavior to active, specialized execution grounded in your real systems and workflows. A Practical Example: Building an Agent Skill With Port.io To make this concrete, consider a real workflow built around Port.io. Port is an agentic internal developer platform that helps teams automate engineering workflows. It acts as a central place where developers can see services, ownership, scorecards, readiness, and other operational data without bouncing between a dozen different tools. In this example, Port’s MCP server is connected, so the AI agent can access live data from a Port account. Once connected, the agent can pull information such as: Services in the catalogBlueprints in the organizationProduction readiness statesScorecard pass/fail data That gives the agent raw access. Then, agent skills provide the behavior and context needed to make that access useful. The Three-File Structure of This Agent Skill The example skill is built around a production readiness workflow and uses three main files. 1. skills.md This is the brain and trigger mechanism of the skill. It includes: The skill nameDescriptionMetadata like author and versionActivation keywordsInstructions for how the agent should behave In this case, the skill is focused on Port readiness. The description includes keywords such as scorecard, level B, and branch protection, so the agent knows when to activate the skill. It also defines the workflow for diagnosing failures, understanding readiness levels, generating PR descriptions, and suggesting fixes. 2. references/scorecard-state.md This file contains the factual reference data. It acts like a snapshot of the actual Port catalog, including the current state of services and scorecard rules. In the example, it includes data for six services and their pass/fail status against readiness rules. This matters because it stops the agent from answering in vague terms. Instead of saying, “You may need better branch policies,” it can say, “This specific service is failing because branch protection is missing and no recent PR activity exists.” 3. assets/fix-checklist.md This file is the remediation playbook. It gives the agent a step-by-step checklist for fixing failures, such as: Assigning the correct teamEnabling branch protectionSetting code ownersEnsuring recent PR freshness So if the reference file tells the agent what is wrong, the checklist tells it how to fix it. What This Skill Enables the Agent to Do Once these files are in place and the Port MCP server is connected, the AI agent becomes dramatically more useful. It can answer questions like: What services are in my Port catalog?What blueprints exist in my organization?Why is the travel service failing its scorecard?Which service is closest to reaching level B?Write a PR description for Agentic AI explaining the readiness impact.Assign Agentic AI to the AI team. And importantly, it can answer these without forcing you to paste all the context into every new conversation. That is the practical magic of agent skills. Context is packaged once, then reused repeatedly. Understanding Port Readiness in This Example The skill in this setup revolves around production readiness in Port. Port readiness is basically a grading system that tells you how production-ready a service is. The levels include things like A, B, C, and F, depending on how many scorecard rules are satisfied. In the example workflow, several services are currently at level C. The agent can inspect the rules, explain why a service is still at level C, and tell you what must be done to move it up to level B. Typical requirements for moving from level C to level B include: Assigning a teamEnabling GitHub branch protectionPushing a recent PR Because the skill has both the scorecard state and the remediation checklist, it can map those rules directly into actionable next steps. How the Interaction Feels in Practice After connecting the MCP server and loading the skill into a coding environment like GitHub Copilot agent mode in VS Code, you can work conversationally. You can ask: Why is the prompt engineering service failing?What team is assigned to this service?How can all my services reach level B?Can you push a simple PR to this service? The agent then checks the skill instructions, pulls the relevant facts from the reference file, uses the checklist for remediation guidance, and responds in a way that is specific to your setup. In the example, the agent can even update team assignments in the scorecard state and suggest exact actions needed to improve readiness. This is a big shift from normal chatbot usage. Instead of asking broad questions and getting broad answers, you are interacting with an agent that understands your environment and your operational rules. Why This Is More Powerful Than Prompts Alone A long prompt can tell an agent a lot of things once. But it is still fragile. Prompts are easy to lose, hard to standardize, and difficult to reuse cleanly across teams and platforms. They also tend to degrade over time as workflows evolve. Agent skills solve that by separating responsibilities: The skill file defines behavior and triggersThe reference file provides facts and current stateThe checklist file provides action plans That structure makes the whole system more maintainable, shareable, and predictable. It also makes it easier to build agents that do not just know tools, but know how your organization actually works. The Bigger Takeaway The important idea here is not just Port, Claude, or one specific tutorial setup. The bigger takeaway is that agent skills are a reusable layer of organizational intelligence for AI agents. You can imagine applying the same pattern to many other internal workflows: Incident triageRelease readinessSecurity policy checksOnboarding flowsDocumentation enforcementInfrastructure review As long as the agent has access to the right tools and data through something like MCP, skills can teach it how to reason and act within that domain. What Makes Agent Skills So Compelling Right Now There are three reasons agent skills feel especially important right now. AI agents are everywhere, but most are still generic.MCP gives agents access, but not domain behavior.Teams need reusable workflows, not prompt improvisation every time. That combination creates the perfect environment for skills to become a foundational pattern. If the first wave of AI was about generating text, and the second wave was about calling tools, this next wave is about packaging expertise so agents can repeatedly perform meaningful work. Final Thoughts Agent skills are one of the clearest signs that AI tooling is maturing from demos into operational systems. They let you encode workflows once, connect them to real systems, and reuse them across platforms. In practical terms, that means your AI agent can stop acting like an outsider and start behaving like a teammate who understands your stack, your rules, and your goals. That is the real leap here. MCP gives your agent the keys. Agent skills teach it how to drive. If you want to explore this approach hands-on, the Port-based production readiness example is a great model: connect your data source, define the skill behavior in skills.md, add factual reference state, add a remediation checklist, and then let the agent work against your real environment. Once you see that flow in action, it becomes obvious why agent skills are getting so much attention. More

Observability on the Edge With OTel and FluentBit

By Graziano Casto

CORE

When we design observability pipelines for modern cloud environments, we implicitly rely on a set of luxurious guarantees: limitless bandwidth, highly available networks, practically infinite storage, and abundant computing power. But when you move these workloads to the edge, think of a maritime vessel navigating the mid-Atlantic or a remote wind turbine, those guarantees vanish. Edge environments are constrained by intermittent connectivity, severe limits on CPU and RAM, and a lack of persistent storage guarantees. You simply cannot run a full, traditional observability stack locally, nor can you stream everything to the cloud without exhausting limited satellite bandwidth. The engineering challenge becomes clear: how do we build a pipeline that reliably captures traces, metrics, and logs, survives unpredictable network outages, and perfectly correlates signals without saturating edge constraints? A highly compelling, production-realistic solution to this problem was showcased for KubeCon EU 2026, demonstrating a fully correlated observability pipeline built for constrained edge environments using OpenTelemetry and Fluent Bit. You can explore the complete implementation in the graz-dev/observability-on-edge repository. This article dives deep into the architecture, the technical trade-offs, and the specific configurations required to make observability work reliably when the network itself is your biggest enemy. Taming the Bandwidth With Tail-Based Sampling The fundamental problem with distributed tracing is the sheer volume of data it generates. A single HTTP request traversing various middleware and downstream services can easily produce up to 50 individual spans. At a relatively modest load of ten requests per second, you are suddenly dealing with hundreds of gigabytes of trace data over a year. In a cloud environment, you might simply scale up your storage. On a maritime vessel connected via an expensive, low-bandwidth satellite link, sending all of this is economically and technically impossible. To solve this, we must aggressively sample the data, dropping the noise and keeping only what is actionable. The most naive approach is head-based sampling, where the system makes a keep-or-drop decision at the very first span of a trace. While head-based sampling adds almost no computational latency, it is entirely blind to the outcome of the request. If you decide to drop a trace at its inception, and that request subsequently fails or experiences a massive latency spike, that crucial diagnostic information is lost forever. In edge deployments where errors might be rare but highly critical, this is unacceptable. The solution is tail-based sampling. By buffering all spans of a trace in memory within the OpenTelemetry Collector, the pipeline waits for the trace to fully complete before making a decision based on the final outcome. In this implementation, the tail-sampling policy is strictly configured to keep 100% of traces that contain an error, and 100% of traces where the duration exceeds 200 milliseconds. Normal, fast, successful traces are discarded entirely. The observed result is a massive reduction in bandwidth overhead, dropping roughly 80% of all spans before they are ever exported over the network. To make this bandwidth reduction truly effective, we must apply the same philosophy to our logs. If we filter traces but send every single log line, we defeat the purpose. For log processing, Fluent Bit runs as a DaemonSet on the edge nodes, tailing container logs. Rather than using Fluent Bit's native grep filters, which struggle with complex multi-field conditional logic, a custom Lua filter is injected into the pipeline. This Lua script precisely mirrors the OpenTelemetry tail-sampling criteria, evaluating each log record and keeping only those with an error level or a duration exceeding 200 milliseconds. By performing this logic at the absolute edge, before the log data even leaves the node, the pipeline drops approximately 86% of log volume at the source, preventing unnecessary network I/O. Persistent Queuing for Intermittent Connectivity Aggressive sampling solves the bandwidth issue during steady-state operations, but what happens when the satellite link inevitably fails? If the OpenTelemetry Collector cannot reach the central hub, it will quickly exhaust its in-memory retry buffers, and all telemetry accumulated during the outage will be permanently lost. To survive these disconnections, the pipeline implements a file-backed persistent queue utilizing the file_storage extension within the OpenTelemetry Collector. This provides a bbolt (BoltDB) key-value store directly on the edge node's local disk. When the exporter's sending queue is configured to use this file storage, outgoing telemetry batches are serialized and written safely to the crash-safe bbolt database before dispatch. These items remain safely on disk until the collector receives a successful acknowledgment from the remote endpoints. Configuring this queue correctly requires understanding a critical nuance of the OpenTelemetry Collector's internal architecture regarding consumers. By default, the exporter uses four concurrent consumer goroutines to claim batches from the queue. Because the processing pipeline generally produces batches relatively slowly at edge traffic volumes, these four consumers will claim batches almost instantly, holding them in their in-memory retry buffers rather than leaving them in the bbolt queue. Consequently, your queue depth metrics will deceptively report zero even during an active network outage, blinding you to the growing backlog. By deliberately setting the num_consumers configuration to exactly one, only a single batch is ever held in flight in memory. All subsequent batches safely queue in bbolt, allowing the metric to accurately reflect the growing backlog during an outage. The Reconnection and the Time Travel Problem When the satellite link is eventually restored, the real chaos begins. The OpenTelemetry Collector detects the restored connection and immediately begins draining the accumulated bbolt queue at maximum network speed. However, the data stored in this queue possesses timestamps from minutes or hours in the past. We are essentially attempting to push historical, out-of-order data into our central observability backends. Different backends handle this "time travel" problem differently. Jaeger, our distributed trace storage, handles it gracefully by design. Its storage model is append-only, possessing no concept of out-of-order rejection. Traces originating from the failure window simply appear in the user interface precisely where they belong chronologically. Loki, handling our logs, is much stricter. By default, Loki expects log entries for a given stream to arrive in roughly chronological order, and it will forcefully reject significantly older timestamps with HTTP 400 errors. If left unconfigured, the OpenTelemetry Collector would receive these errors continuously upon queue drain, leading to the permanent loss of all logs generated during the outage. To prevent this disaster, we must explicitly configure unordered_writes: true in the Loki settings. This crucial parameter disables the strict per-stream ordering requirement, allowing the massive burst of queued, historical log entries to be ingested successfully. Metrics ingestion presents an even harsher reality. In this architecture, metrics are exported to Prometheus using the prometheusremotewrite exporter. Unlike logs and traces, the OpenTelemetry Collector library supporting this exporter lacks support for the file-backed bbolt queue, leaving the metrics queue strictly in-memory. Furthermore, when the link restores, any old metrics held in memory are sent to Prometheus, but Prometheus natively rejects out-of-order samples. While there are alternative protocols like OTLP HTTP for metrics, utilizing OTLP for Prometheus ingestion results in aggressive HTTP 400 rejections for out-of-order data. This causes the exporter to retry indefinitely, permanently blocking the queue and grinding the entire metrics pipeline to a halt. It is crucial to note that this specific limitation — the inability to use file-backed queues for metrics — is a known library constraint tied directly to the OpenTelemetry Collector versions (v0.95 and v0.96) utilized in this repository. Because these specific builds do not support sending_queue.storage: file_storage for the Prometheus remote-write exporter, the architecture is forced to keep the metrics queue in RAM. The architectural decision here is a deliberate and calculated engineering trade-off: by using the prometheusremotewrite exporter targeting the remote-write endpoint, Prometheus silently skips the out-of-order samples by returning an HTTP 204 status with zero written. The pipeline queue drains cleanly and unblocks, but the metric data generated during the actual outage window is intentionally sacrificed. At the edge, maintaining pipeline integrity is often prioritized over absolute metric continuity. Achieving Deterministic Signal Correlation An observability stack is only as valuable as its ability to correlate signals. A latency spike on a dashboard must lead seamlessly to the exact distributed trace, which must seamlessly transition to the specific log lines emitted during that exact request. In this edge architecture, achieving perfect correlation is not reliant on best-effort timestamp matching; it is structurally guaranteed. The process begins in the application code, which extracts the OpenTelemetry trace_id and span_idfrom the context of every incoming HTTP request and structurally injects them into every log line via a JSON logger. Because the Fluent Bit Lua filter and the OpenTelemetry tail-sampling processor utilize the exact same logic, we achieve a deterministic alignment. Every trace that survives the sampler will have a corresponding log line surviving the Lua filter, and conversely, no log line will exist without a parent trace. There are no orphaned traces and no orphaned logs. To link our high-level metrics directly to these traces, the architecture employs the OpenTelemetry spanmetrics connector. This connector reads the sampled spans and generates Prometheus histogram metrics regarding request rates and latencies. Crucially, it attaches an exemplar to each histogram bucket, a sparse metadata annotation carrying the specific trace_id that contributed to that latency measurement. The placement of this connector is paramount. In the configuration pipeline, spanmetrics is wired strictly after the tail sampling processor. Because it runs post-sampling, every single trace_id it embeds into a metric exemplar is absolutely guaranteed to have survived the sampling process and exist in Jaeger. When operators view their Grafana dashboards, they see diamond markers on their latency graphs representing these exemplars; clicking a marker reliably drops them directly into the exact failing trace with zero dead links. Performance Testing To prove this architecture works under realistic conditions, the project doesn't just send a few manual requests. Instead, it employs the k6 Operator, a Kubernetes-native load test runner, to generate a continuous, high-volume telemetry stream from within the cluster hub. The load generator utilizes a custom TestRun resource that spins up 500 virtual users over a 40-minute period. This performance test follows a deliberate ramp-up profile: it scales from zero to 100 virtual users in the first 30 seconds, climbs to 250 in the next 30 seconds, reaches 500 at the one-minute mark, and sustains that peak for a 40-minute steady state. At its peak, this setup generates approximately 2,500 spans per second. The traffic is intelligently distributed across four distinct API endpoints simulating a vessel's systems, each with specific latency and error profiles, perfectly exercising the tail-sampling and Lua filtering logic. While the k6 load test validates the pipeline's throughput, the underlying reality is that rigorous performance testing was essential for a much more critical goal: drastically minimizing the collector's resource footprint. In constrained edge environments, every megabyte of RAM and CPU cycle consumed by observability is stolen directly from the primary application workloads. Through an extensive, automated performance tuning campaign, we analyzed the complex interactions between the Go runtime and the collector's internal processors. The findings revealed that optimizing an edge node requires surgically tuning both the OTel configuration and the underlying Go environment rather than simply guessing at limits. By methodically testing various permutations, we discovered the exact "sweet spot" to maximize performance while shrinking the footprint. The most impactful findings from these tests led to highly specific internal calibrations. For example, the memory_limiter processor was precisely tuned to enforce a soft limit of 320 MiB and a hard limit of 400 MiB. This was paired with a batch processor rigorously configured to accumulate exactly 512 spans or wait for a maximum 5-second timeout. Furthermore, these tests demonstrated that throttling the exporter queue to a single consumer (num_consumers: 1) was critical. It not only provided accurate backpressure metrics during a simulated satellite outage but structurally prevented the Go runtime's garbage collector from thrashing when the connection was restored and massive historical queues suddenly drained. The results of this optimization campaign are striking. Stripped of the bloated default components via the OpenTelemetry Collector Builder, the resulting 30 MB binary operates seamlessly under intense pressure. It continuously processes thousands of spans per second while consuming merely 1% to 5% of a single CPU core and hovering predictably between 80 MiB and 150 MiB of active memory. This definitively proves that with proper performance testing and exact Go runtime calibrations, you do not have to choose between rich telemetry and edge node stability. Validating Resilience Through Network Chaos Proving that resilience actually works requires simulating harsh physical realities within a controlled Kubernetes environment. Relying on high-level Kubernetes NetworkPolicies is insufficient for this testing, as they do not provide the surgical, instantaneous, and reversible IP-layer control needed to simulate an abrupt satellite drop. Instead, the project utilizes a privileged DaemonSet running netshoot, a network debugging container. Operating in the host network namespace, this pod can directly manipulate the edge node's kernel routing rules using iptables. A dedicated chaos script surgically inserts DROP rules into the FORWARD chain, specifically targeting the outbound ports for Jaeger, Loki, and Prometheus. A critical detail in this simulation is the behavior of the Linux kernel's connection tracking framework, conntrack. Modern kernels maintain state for established TCP connections, allowing them to bypass newly inserted DROP rules. If you apply an iptables drop rule without further action, existing gRPC connections between the collector and the hub will simply continue to flow unaffected. The chaos script explicitly executes a conntrack flush command targeting the collector's IP address. This violently terminates the established states, forcing the client to initiate a new TCP handshake, which is immediately blocked by the new rules. This accurately triggers the failure state: the OpenTelemetry exporter begins failing, batches begin piling up in the bbolt database, and the queue depth metrics steadily climb. Removing the rules simulates link restoration, triggering the massive, satisfying spike in export throughput as the resilient queue drains historical data into the backends. Conclusion Observability at the edge forces engineering teams to abandon the comfortable defaults of cloud-native computing. We cannot afford to transmit every metric, log line, and trace span. By combining aggressive tail-based and source-based sampling, highly localized persistent queuing, out-of-order gap-filling configurations, and meticulous correlation through exemplars, it is completely possible to maintain deep, actionable visibility into remote, constrained environments. The implementation presented in the graz-dev/observability-on-edge repository serves as an example of these techniques. It proves that with strict resource management and a deep understanding of network behavior, robust edge observability is not just a theoretical concept, but a highly achievable engineering reality. More

Trend Report

Security by Design

Security teams are dealing with faster release cycles, increased automation across CI/CD pipelines, a widening attack surface, and new risks introduced by AI-assisted development. As organizations ship more code and rely heavily on open-source and third-party services, security can no longer live at the end of the pipeline. It must shift to a model that is enforced continuously — built into architectures, workflows, and day-to-day decisions — with controls that scale across teams and systems rather than relying on one-off reviews.This report examines how teams are responding to that shift, from AI-powered threat detection to identity-first and zero-trust models for supply chain hardening, quantum-safe encryption, and SBOM adoption and strategies. It also explores how organizations are automating governance across build and deployment systems, and what changes when AI agents begin participating directly in DevSecOps workflows. Leaders and practitioners alike will gain a grounded view of what is working today, what is emerging next, and what security-first software delivery looks like in practice in 2026.

Refcard #388

Threat Modeling Core Practices

By Apostolos Giannakidis

CORE

Refcard #401

Getting Started With Agentic AI

By Lahiru Fernando

The LLM Selection War Story: Part 1 - Why Your Model Selection Process is Fundamentally Broken

Here's a confession that'll probably get me kicked out of the AI engineering community: I spent three months selecting an LLM based on benchmark scores, built an entire production system around it, and watched it fail spectacularly in ways no benchmark predicted. The model scored 94% on reasoning tasks. It couldn't handle a simple user asking "wait, what did I just say?" without losing its mind. Let me tell you why everything you think you know about choosing an LLM is probably wrong, and more importantly, what metrics actually matter when your system is bleeding money because your chosen model decided to hallucinate pricing information to paying customers. The Benchmark Theatre: A Production Horror Story December 2023. I'm sitting in a conference room with our management, presenting my carefully researched comparison of GPT-4, Claude 2, and Gemini. Beautiful slides. Color-coded charts. GPT-4: 92% on reasoning benchmarks. Claude: 89%. Gemini: 87%. Decision made in 15 minutes. We went with GPT-4 because, obviously, 92% > 89%. Fast forward two weeks into production. Our customer support chatbot, powered by our shiny 92%-scoring model, started doing something... weird. It would answer the first three questions perfectly. Question four? Suddenly it forgot the customer's name. Question five? It contradicted its answer from question two. Question six? It started making up features our product didn't have. The Reality Check: That 3% difference in benchmark scores? Meaningless. The model's inability to maintain context coherence over a 10-turn conversation? Not measured by any benchmark we evaluated. We discovered this the hard way when a customer tweeted a screenshot of our chatbot confidently claiming we offered a "Premium Diamond Tier" subscription. We've never had a Premium Diamond Tier. The tweet got 15,000 retweets. Our VP was not amused. The Metrics That Actually Matter (And Nobody Talks About) After our Premium Diamond Tier incident, I did what any reasonable engineer would do: I stopped trusting benchmarks entirely and started measuring what was actually breaking in production. Over the next six weeks, we instrumented everything. Every conversation turn. Every context window. Every tool call. Every weird behavior. What emerged was a completely different picture of model performance. Here are the three metrics that became our North Star, and why you've probably never heard of them: 1. Mean Time To Weird Behavior (MTTWB) This is my favorite metric because it sounds ridiculous but predicts production failures better than any benchmark. MTTWB measures how many conversation turns pass before the model does something that makes users go "wait, what?" For our GPT-4 deployment, MTTWB was 4.7 turns. Sounds decent until you realize that 68% of our customer support conversations lasted 8+ turns. We were essentially guaranteed weirdness in two-thirds of interactions. When we tested Claude 2.1 (which scored 3% lower on benchmarks), MTTWB was 12.3 turns. In production terms, this meant 82% of conversations completed without weird behavior. That 3% benchmark difference? Represented a 300% improvement in conversation reliability. Here's what "weird behavior" actually looks like in production: Forgetting the user's name mid-conversation (happened 847 times in month one)Contradicting previous statements without acknowledging the changeHallucinating product features, pricing, or capabilitiesSuddenly switching to a different language or toneLosing track of what problem the user was trying to solve The kicker? None of these behaviors show up in single-turn benchmark tests. They're emergent properties of multi-turn conversations with real context management challenges. 2. Context Rot Rate (CRR) This one took us forever to even identify as a problem. Context Rot Rate measures how quickly a model's understanding of the conversation context degrades as the context window fills up. We discovered this when analyzing failed conversations. Early in the conversation (turns 1-3), models were brilliant. Accuracy was 94%+. By turn 8, with the context window at 60% capacity, accuracy dropped to 76%. By turn 12, with the window at 85% capacity, accuracy was 61%. But here's where it gets interesting: this degradation wasn't linear, and it varied wildly by model. GPT-4 showed a sharp drop-off after 50% context utilization. Claude maintained accuracy much longer, degrading gracefully. Gemini fell off a cliff at 40% utilization. In production terms, this meant: GPT-4: Had to reset context every 6-7 turns, frustrating users who felt like they were constantly re-explaining themselvesClaude 2.1: Could maintain coherent conversations for 12-15 turns before needing a context resetGemini: Basically unusable for our multi-turn support conversations The benchmark scores that showed GPT-4 as "better"? They didn't measure any of this because they didn't stress the context window with realistic conversation loads. 3. Tool Call Consistency (TCC) This metric nearly broke my brain when we first identified it. Tool Call Consistency measures how reliably a model follows tool-calling patterns across a conversation. Our chatbot had access to six tools: check_order_status, update_shipping_address, process_refund, escalate_to_human, search_knowledge_base, and create_support_ticket. Simple enough, right? Wrong. Here's what actually happened in production: See that "Same Tool Recall Rate"? That measures whether the model remembers that it already used a tool earlier in the conversation. GPT-4 scored highest on initial tool calls but forgot its own actions 42% of the time in longer conversations. Real example from our logs: Turn 2: Model calls check_order_status("12345") - works perfectlyTurn 5: User asks "what was the status again?" - Model calls check_order_status("12345") again instead of referencing the earlier resultTurn 7: User asks for an update - Model calls check_order_status("12345") a third time This pattern cost us thousands in unnecessary API calls and made conversations feel robotic and repetitive. Users noticed. Our CSAT scores dropped 12 points in the first month. The Hidden Cost: Poor Tool Call Consistency didn't just annoy users — it tripled our operational costs. We were making 3x the necessary API calls because the model kept forgetting it had already fetched information. Why Benchmarks Get This So Wrong After six months of production data, I finally understood why benchmark scores are fundamentally misleading for production LLM selection. It's not that benchmarks are useless — they measure something. It's that what they measure has almost no correlation with production success. Here's the brutal truth: benchmarks are designed to be passable by models, not to predict real-world failure modes. They test atomic capabilities (can you answer this question correctly?) rather than emergent behaviors (can you maintain context coherence across 15 turns while managing three concurrent tool calls?). Think about it like this: a driving test measures whether you can parallel park and use turn signals. It doesn't measure whether you'll stay calm when your GPS fails during rush hour in an unfamiliar city while your kids are screaming in the back seat. The atomic skills matter, but the emergent behavior under stress is what actually determines success. Our production data showed effectively zero correlation (R² = 0.12, p > 0.05) between benchmark scores and production success metrics. A model scoring 92% on benchmarks wasn't more likely to succeed in production than one scoring 89%. But a model with an MTTWB of 12 turns was 3.4x more likely to succeed than one with an MTTWB of 4 turns (R² = 0.87, p < 0.001). The Selection Framework Nobody Uses (But Should) Here's what I wish someone had told me before we deployed our first production LLM: ignore the benchmarks until you've measured what actually matters. We ended up developing a three-phase testing process that predicted production success with scary accuracy: Phase 1: Stress Test Multi-Turn Conversations (Week 1-2) Run 1,000+ synthetic conversations of 15+ turns eachDeliberately introduce context complexity (multiple topics, user corrections, tangents)Measure MTTWB, context rot rate, and tool call consistencyModels that can't survive this don't make it to phase 2 Phase 2: Shadow Production Traffic (Week 3-4) Run candidates in parallel with current production systemCompare outputs but don't serve to users yetLook for edge cases, unexpected failures, and cost patternsThis is where GPT-4 revealed its context management issues Phase 3: Limited Production Rollout (Week 5-6) 5% of traffic to new model, 95% to existingMeasure CSAT, completion rates, escalation ratesWatch for issues that only appear with real user behaviorClaude 2.1 passed this with flying colors; GPT-4 did not Total time investment: 6 weeks. Money saved by not deploying the wrong model: approximately $180,000 in unnecessary API calls and context resets, plus another $250,000 in lost customer satisfaction and support escalations. The Bottom Line: We spent three months on benchmark-based selection and chose the wrong model. We spent six weeks on production-realistic testing and chose the right one. The correlation? Perfect. What This Means For Your Selection Process If you're choosing an LLM right now based on benchmark scores, stop. Just stop. You're optimizing for the wrong thing. It's like choosing a car based solely on its top speed when you're going to use it for daily commuting in city traffic. Here's what you should do instead: Define your conversation patterns first: Average conversation length? Context complexity? Tool usage patterns?Measure what matters: MTTWB, CRR, and TCC for your specific use caseTest in production-like conditions: Synthetic conversations with realistic complexityShadow test before committing: Run candidates against real traffic before going liveMonitor continuously: Production behavior changes; your metrics should too In Part 2, I'll walk through comprehensive testing framework for detecting the six critical failure patterns that destroy production LLM systems. You might be surprised by what you find. For now, if you take away one thing from this article, let it be this: a 92% benchmark score tells you the model passed a test. An MTTWB of 4.7 turns tells you it's going to fail in production. Trust the metric that predicts actual failure, not the one that measures artificial success.

By Dinesh Elumalai

CORE

The Pod Prometheus Never Saw: Kubernetes' Sampling Blind Spot

The Fix That Doesn't Fix It Reducing your Prometheus scrape interval from 15 seconds to 5 seconds does not fix the sampling blind spot. It moves it. Any pod whose entire lifetime falls within one 5-second scrape gap is still structurally invisible — not because of misconfiguration, not because of missing rules, but because poll-based collection has an irreducible sampling gap that no interval setting eliminates. This article explains exactly why that is, what it costs in production, and what actually fixes it. What Is the H5 Evidence Horizon? Kubernetes evidence horizons are deterministic points after which specific diagnostic context becomes permanently unrecoverable. H5 — the scrape-interval sampling blind spot — is the only horizon that prevents observability data from being created in the first place. Unlike H1 (LastTerminationState rotation at ~90 seconds) or H2 (scheduler event pruning at 1 hour), H5 has no timer and no API call. It fires silently for every pod whose entire lifetime falls within one Prometheus scrape gap. The full evidence horizon taxonomy is documented at opscart.com/kubernetes-evidence-horizons-h2-h3-h4-h5/. Why Poll-Based Observability Has an Irreducible Blind Spot Prometheus collects metrics by sending HTTP requests to targets at a fixed interval. The default scrape interval in kube-prometheus-stack is 15 seconds. Every 15 seconds, Prometheus asks the world: "What is your current state?" This model works exceptionally well for persistent, long-running workloads. A deployment that has been running for hours will be scraped hundreds of times. Its CPU trends, memory patterns, and request rates are captured with high fidelity. It fails completely for ephemeral workloads — and Kubernetes generates ephemeral workloads by design. The math is straightforward. Given a scrape interval S and a pod lifetime L: If L > S: the pod will be scraped at least once, generating at least one data pointIf L < S: the pod may generate zero data points — not because of any failure in Prometheus, but because it never existed between two consecutive scrape cycles This is not a probability statement. It is deterministic. A pod with a 6-second lifetime and a 15-second scrape interval will generate exactly zero Prometheus data points if its entire lifetime falls within one scrape gap. There is no configuration change that fixes this for that specific pod in that specific gap. The only way to eliminate the blind spot entirely is to move from a poll-based model to an event-driven model. And this is precisely the architectural distinction that most observability discussions miss. The Ghost Pod Experiment To validate this claim empirically, I ran a controlled experiment on a 3-node Minikube cluster (Kubernetes 1.31, Apple M-series hardware). Setup: Pod memory limit: 64MiPod memory allocation: 128Mi (guaranteed OOMKill)Prometheus scrape interval: 15s (kube-prometheus-stack default)Pod name: ghost-pod, namespace: oma-sampling What happened: The pod started, allocated memory beyond its limit, and was OOMKilled by the kernel at T+5s. Total observed pod lifetime: 6 seconds. Prometheus result: SQL # Query executed the morning after the experiment $ promql: container_cpu_usage_seconds_total{pod="ghost-pod"} {} # empty — 0 data points $ promql: kube_pod_container_status_last_terminated_reason{pod="ghost-pod"} {} # empty — 0 data points $ kubectl get pod ghost-pod -n oma-sampling Error from server (NotFound): pods "ghost-pod" not found Zero data points. No alert. No record. From Prometheus's perspective, ghost-pod never existed. Event-driven result: An OMA (Operational Memory Architecture) collector subscribed to the Kubernetes watch API captured the following at the moment of occurrence: SQL OOMKill P001 captured at T+5s pod: ghost-pod namespace: oma-sampling exit_code: 137 memory_limit: 64Mi node: opscart-m03 timestamp: 2026-04-18T23:38:06Z The causal evidence — exit code, resource limits, node placement — captured at occurrence. No scrape gap. No sampling window. The watch API delivers every pod state transition at the moment it fires, regardless of timing. Poll-based vs event-driven architecture: a pod with a 6-second lifetime falls entirely within one 15-second Prometheus scrape gap, generating zero data points. An event-driven collector subscribed to the Kubernetes watch API captures the OOMKill at occurrence — no sampling gap exists by architecture. "Just Reduce the Scrape Interval" This is the most common response when engineers first encounter the H5 blind spot. It deserves a direct answer. Reducing the scrape interval from 15s to 5s does not eliminate the blind spot. It shifts the threshold from 15 seconds to 5 seconds. Any pod whose lifetime falls within one 5-second scrape gap is still structurally invisible. Consider the real-world distributions: CrashLoopBackOff with OOMKill on startup: A pod that allocates memory before its first checkpoint can OOMKill in under 1 second. No scrape interval short of continuous polling catches this. Init container failures: Init containers that fail immediately may have lifetimes measured in milliseconds. These are architecturally invisible to any poll-based system, regardless of scrape interval. Batch job bursts: Short-lived Job pods in a batch processing cluster can complete their entire lifecycle — start, run, succeed, or fail — within a single scrape gap at any reasonable interval. Reducing the scrape interval also has real costs: Storage: Prometheus metric storage grows proportionally with scrape frequency. Moving from 15s to 5s triples your time-series storage requirements.Cardinality: More frequent scrapes of high-cardinality metrics (per-pod, per-container) increase label cardinality and query latency.Target load: Every scrape is an HTTP request to your metrics endpoints. High scrape frequencies create measurable load on instrumented services. You are paying a real cost to shift the threshold — not to eliminate it. For workloads with sub-second or sub-5-second lifetimes, no scrape interval is fast enough. Why the Watch API Is Structurally Different The Kubernetes watch API is not a faster poll. It is a fundamentally different delivery mechanism. When you run kubectl get pods --watch, you are not asking Kubernetes "what is the current pod state every N seconds." You are opening a long-lived HTTP connection to the API server and subscribing to a stream of state change events. Every time a pod transitions — from Pending to Running, from Running to Terminated, from any state to OOMKilled — the API server pushes that transition to every active watcher. The delivery is at-occurrence. There is no polling interval. There is no sampling gap. If a pod OOMKills at T=17.3 seconds, the watch API delivers that event at T=17.3 seconds — not at the next scrape boundary. This means the H5 blind spot does not exist for event-driven collectors by architecture. A pod with a 6-second lifetime generates exactly one OOMKill transition event. That event is delivered to every watcher at the moment it fires. The watcher captures it. Done. The practical implication: event-driven collection provides complete coverage of pod lifecycle events regardless of pod lifetime, without any configuration tuning. What Sampling Blind-Spot Costs in Production The blind spot has three concrete operational consequences. Undetected crash loops. A pod in CrashLoopBackOff with a very short failure cycle can OOMKill dozens of times per hour without generating a single Prometheus alert. The restart counter increments in kubectl get pods output, but if nobody is looking at that specific pod, the pattern goes undetected. By the time an engineer investigates, the pod may have crashed hundreds of times with no metric record of any individual failure. Incomplete capacity planning. Short-lived batch pods that OOMKill during processing spikes are invisible to Prometheus-based capacity analysis. Your memory utilization reports show only long-running pods. The actual peak memory demand — which caused the batch pod OOMKills — never appears in your capacity data. Silent compliance gaps. In pharmaceutical and financial production environments with audit requirements, unrecorded container failures are a compliance problem. An auditor asking "what failed in this namespace between 2 AM and 4 AM on this date" deserves a complete answer. A Prometheus query that returns empty results for pods that actually OOMKilled is not a complete answer. The Structural Fix The H5 blind spot cannot be patched within a poll-based architecture. The fix is additive: complement Prometheus with an event-driven collector that subscribes to the Kubernetes watch API. This does not mean replacing Prometheus. Prometheus remains the right tool for what it does — metric aggregation, trend analysis, alerting on long-running workloads. The event-driven collector handles what Prometheus cannot: discrete lifecycle events for pods of any duration. The implementation I've validated uses a Go-based collector subscribing to CoreV1().Pods(namespace).Watch(). On each Modified event, the collector inspects ContainerStatus for OOMKill signals and captures the full forensic context synchronously — before the pod restarts and overwrites LastTerminationState. Go // Simplified watch loop watcher, _ := clientset.CoreV1().Pods(namespace).Watch( ctx, metav1.ListOptions{}) for event := range watcher.ResultChan() { pod := event.Object.(*corev1.Pod) for _, cs := range pod.Status.ContainerStatuses { if cs.LastTerminationState.Terminated != nil { reason := cs.LastTerminationState.Terminated.Reason if reason == "OOMKilled" { captureOOMKillEvidence(pod, cs) } } } The watch API delivers the event at occurrence. The capture is synchronous. No polling gap. No sampling threshold. Ghost pods are no longer invisible. Full implementation with reproducible Minikube scenarios is at github.com/opscart/k8s-causal-memory. H5 in Context: The Evidence Horizon Taxonomy H5 is one of five evidence destruction mechanisms I've identified and formalized as an evidence horizon taxonomy. The full taxonomy: HorizonTriggerWhat's lostH1Pod restart (~90s)OOMKill forensics, limits, ConfigMapsH2Event TTL (1hr/1000)Scheduler placement rationaleH3Debug session exitkubectl debug exit code, durationH4Kubelet restartIn-memory operational stateH5Scrape intervalSub-interval pod lifetimes H5 is unique in the taxonomy: H1 through H4 destroy the Kubernetes API state that previously existed. The scrape-interval blind spot prevents observability data from being created in the first place. It is the only horizon that requires no destruction event — the evidence simply never reaches any persistent store. The full taxonomy with empirical validation across Minikube and AKS 1.32.10 is documented in the canonical OpsCart article: Beyond the 90-Second Gap and in the research preprint at Zenodo DOI: 10.5281/zenodo.19685352. Conclusion The H5 blind spot is not a Prometheus bug. It is not a configuration problem. It is an irreducible consequence of poll-based collection applied to a platform that generates arbitrarily short-lived workloads. Kubernetes is designed to self-heal faster than humans can observe. A pod that OOMKills in 6 seconds and restarts in 2 is working exactly as designed. Prometheus, also working exactly as designed, sees nothing. The architectural answer is equally straightforward: subscribe to the Kubernetes watch API. Receive events at occurrence. No scrape interval. No sampling gap. No ghost pods. Every pod that crashes in your cluster deserves a record. The watch API ensures it gets one. Resources: github.com/opscart/k8s-causal-memory — open-source implementation with reproducible H5 scenarioBeyond the 90-Second Gap — full evidence horizon taxonomy (OpsCart canonical)Research preprint — 30-run statistical analysis, AKS 1.32.10 validation

By Shamsher Khan

CORE

AWS Bedrock: The Future of Enterprise AI

Generative AI has moved from experimental prototypes to production‑grade systems in a remarkably short time. Yet for most engineering teams, the challenge isn’t building a model — it’s deploying AI responsibly inside an enterprise environment. Issues like data privacy, model governance, cost control, and integration with existing systems often overshadow the excitement of large language models. AWS Bedrock is Amazon’s answer to this problem. Rather than offering a single model or framework, Bedrock provides a managed platform where enterprises can access multiple foundation models, build retrieval‑augmented generation (RAG) pipelines, orchestrate agents, and deploy AI features without exposing sensitive data or managing infrastructure. In many ways, Bedrock represents a shift in how organizations will adopt AI over the next decade. This article explores why Bedrock is gaining momentum, how it fits into modern architectures, and why it has the potential to become the backbone of enterprise AI. 1. A Unified Platform for Foundation Models One of Bedrock’s most compelling features is its multi‑model strategy. Instead of locking developers into a single model family, Bedrock provides access to models from: Amazon (Titan)Anthropic (Claude)Meta (Llama)Cohere (Command)Stability AI (Stable Diffusion)Mistral AI (Mistral, Mixtral) This model‑agnostic approach matters because no single model is best for every workload. Enterprises often need: A reasoning‑heavy model for agentsA compact model for low‑latency tasksA vision‑capable model for document processingA multilingual model for global applications Bedrock abstracts away the complexity of switching models, allowing teams to upgrade or experiment without rewriting pipelines. 2. Enterprise‑Grade Security and Data Isolation Most organizations hesitate to adopt generative AI because of data privacy concerns. Bedrock addresses this directly: Customer data is not used to train foundation modelsAll traffic can be restricted to private VPC endpointsKMS encryption protects data in transit and at restCloudTrail provides full auditabilityIAM policies control access at a granular level For regulated industries — finance, healthcare, insurance, government — these guarantees are essential. Bedrock’s security posture is one of the main reasons enterprises are adopting it faster than open‑source or public API alternatives. 3. Retrieval‑Augmented Generation (RAG) as a First‑Class Citizen Most enterprise AI applications rely on RAG rather than fine‑tuning. Bedrock integrates tightly with: Amazon OpenSearchAmazon AuroraAmazon DynamoDBAmazon S3Amazon Kendra Developers can build RAG pipelines using Bedrock’s built‑in Knowledge Bases, which handle: Document ingestionChunkingEmbedding generationVector storageRetrieval orchestration This reduces the complexity of building production‑grade RAG systems, which traditionally require stitching together multiple open‑source components. 4. Bedrock Agents: The Next Step in Automation Agents are one of Bedrock’s most innovative features. They allow developers to create autonomous workflows powered by LLMs that can: Call APIsExecute business logicRetrieve data from enterprise systemsMaintain context across stepsHandle multi‑turn interactions Instead of writing custom orchestration code, developers define: The agent’s instructionsThe tools it can useThe data sources it can access Bedrock handles the reasoning, planning, and execution. 5. Integration With Existing AWS Ecosystems Bedrock fits naturally into the AWS stack. It integrates with: LambdaStep FunctionsAPI GatewaySageMakerCloudWatchIAM This makes Bedrock a drop‑in component for existing architectures rather than a standalone system. 6. Cost Control and Predictable Pricing Bedrock addresses cost concerns through: Token‑based pricingProvisioned throughput for predictable workloadsModel‑specific cost tiersNo GPU management Teams can scale usage without worrying about GPU clusters or autoscaling. 7. Architecture Diagrams (Text Descriptions) High‑Level Bedrock Architecture Text Description: A three‑layer diagram: 1. Client Layer Web appMobile appInternal tools 2. Application Layer API GatewayLambdaStep FunctionsBedrock Agents 3. Data & AI Layer Bedrock Foundation ModelsKnowledge Bases (OpenSearch / DynamoDB)S3 Data LakeCloudWatch Logging Arrows show requests flowing from client → API Gateway → Lambda → Bedrock → Knowledge Base → back to client. RAG Pipeline on AWS Text Description: A left‑to‑right flow: S3 Bucket (raw documents)Knowledge Base (chunking + embeddings)Vector Store (OpenSearch or DynamoDB)RetrieverBedrock Model (Claude / Titan)Response to Application Bedrock Agent Workflow Text Description: A loop diagram: User Query →Bedrock Agent →Tool Invocation (Lambda / API) →External System →Response →Agent Reasoning →Final Answer 8. Code Examples Below are realistic examples you can include. Example 1: Calling Bedrock From AWS Lambda (Python) Python import boto3 import json client = boto3.client("bedrock-runtime") def lambda_handler(event, context): prompt = event.get("prompt", "Hello from Lambda!") response = client.invoke_model( modelId="anthropic.claude-3-sonnet", body=json.dumps({ "messages": [{"role": "user", "content": prompt}], "max_tokens": 300 }) ) result = json.loads(response["body"].read()) return {"answer": result["content"][0]["text"]} Example 2: Simple RAG Query Using Bedrock + OpenSearch Python from opensearchpy import OpenSearch import boto3 import json bedrock = boto3.client("bedrock-runtime") os_client = OpenSearch(hosts=["https://my-domain"]) def rag_query(question): # 1. Retrieve relevant chunks results = os_client.search( index="kb-index", body={"query": {"match": {"text": question}} ) context = "\n".join([hit["_source"]["text"] for hit in results["hits"]["hits"]]) # 2. Send to Bedrock response = bedrock.invoke_model( modelId="anthropic.claude-3-sonnet", body=json.dumps({ "messages": [ {"role": "system", "content": "Use the provided context."}, {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"} ], "max_tokens": 300 }) ) return json.loads(response["body"].read())["content"][0]["text"] Example 3: Bedrock Agent Tool Definition (JSON) JSON { "agentName": "OrderAssistant", "instruction": "Help users check order status.", "tools": [ { "toolName": "OrderAPI", "description": "Fetch order details", "schema": { "type": "object", "properties": { "orderId": { "type": "string" } }, "required": ["orderId"] } } ] } Example 4: Lambda Tool for Bedrock Agent Python def lambda_handler(event, context): order_id = event["orderId"] # Simulated lookup return { "orderId": order_id, "status": "Shipped", "expectedDelivery": "2026-01-10" } Conclusion AWS Bedrock is more than a model hosting service — it’s a strategic platform designed for the realities of enterprise AI. By combining security, multi‑model flexibility, RAG tooling, agent orchestration, and deep AWS integration, Bedrock gives engineering teams a practical path to building AI‑powered applications without compromising governance or maintainability. As organizations move from prototypes to production, Bedrock is positioned to become one of the most important components in the enterprise AI stack. Its design reflects a simple truth: the future of AI isn’t just about models — it’s about building systems that enterprises can trust.

By Subrahmanyam Katta

What AI Systems Taught Us About the Limits of Chaos Engineering

In the early days of Chaos Monkey, breaking things at random was almost a badge of honor. Kill a service. Drop a node. Add latency. Watch what happens. That model made sense when most systems were relatively deterministic, and the primary question was simple: Will the application survive if a component disappears? But AI infrastructure has changed the problem. In environments built on LLM pipelines, vector stores, retrieval systems, inference gateways, and automated control loops, random failure injection is no longer enough. In some cases, it is not even the right test. Breaking a node is easy. Breaking a system’s ability to preserve its intended behavior under stress is much harder and much more relevant. That is why chaos engineering needs a new layer: intent. As AI systems become more autonomous, resilience can no longer be measured only by uptime. We also need to know whether the system continues to behave correctly when critical assumptions fail. That requires moving from random chaos to intent-based chaos engineering: a methodology where architects define what “healthy” means, then deliberately challenge the system’s ability to maintain that state under realistic failure conditions. The difference is simple. Random chaos asks, “What breaks if I inject failure?” Intent-based chaos asks, “Can this system still preserve the outcome it was designed to deliver?” That shift matters more in AI infrastructure than almost anywhere else. The Problem With Random Chaos in AI Systems Traditional chaos experiments are infrastructure-centric. Engineers kill pods, introduce network loss, or terminate processes to verify that failover mechanisms work. These are useful tests, but they often miss the kinds of failures that matter most in AI-heavy systems. A generative AI stack can remain “up” while still being operationally broken. A retrieval layer might respond within SLA, yet return a degraded context. A model gateway may remain available while silently increasing hallucination risk because upstream embeddings have drifted. An inference service may autoscale correctly while downstream rate limiting causes user-facing timeouts. None of these show up cleanly in the old chaos model. In AI-driven infrastructure, the most dangerous failures are often not binary. They are semantic, degradational, and behavioral. This is where intent becomes essential. If the purpose of a retrieval pipeline is to preserve context relevance under load, then resilience testing should validate that outcome. If the purpose of an AI operations system is to maintain stable incident triage during telemetry spikes, then chaos experiments should target that objective — not just randomly break a component and hope the results are meaningful. Defining the Intent Layer Intent is the operational expression of business logic. It translates human expectations into machine-verifiable conditions. For a distributed AI service, intent might look like this: Retrieval latency must remain below 300msContext recall must stay above an acceptable thresholdInference failover must not degrade policy enforcementCritical monitoring signals must remain explainable during incident conditions This matters because AI systems are rarely judged only by infrastructure availability. They are judged by whether they preserve correctness, quality, and trustworthiness under stress. Intent-based chaos engineering starts by making those expectations explicit. Instead of saying, “Let’s kill 20% of the cluster,” the question becomes: What system behavior are we trying to preserve?Which conditions threaten that behavior?How do we validate whether the system remained aligned to intent? That makes the experiment far more useful, especially in production-adjacent environments where blind failure injection can create more noise than insight. From State to Intent Most observability systems are good at reporting the state. They can tell you CPU usage, request latency, pod restarts, error counts, queue depth, or database saturation. What they often cannot tell you directly is whether the system is still fulfilling its intended purpose. Intent-based chaos requires a feedback loop between state and intent. A simplified view looks like this: Plain Text [Business Objective] | v [Intent Specification] | v [Observed System State] ---> [State vs. Intent Evaluation] | | | v | [Intent Preserved?] | / \ | Yes No | / \ v v v [Continue Operations] [Record Stability] [Trigger Remediation] This model changes the role of chaos engineering. Instead of being a destructive test harness, it becomes a controlled system for measuring whether the platform can keep delivering the outcomes the business actually depends on. Predictive Stress Injection, Not Random Breakage The next step is stress injection. In a traditional chaos framework, the experiment might be: Terminate a service instanceIntroduce packet lossDegrade a dependencyCreate a network partition In intent-based chaos, the experiment is chosen because it challenges a known operational dependency tied to the target behavior. For example, in an AI retrieval system, you may not care whether a single shard fails in isolation. You care whether shard degradation causes context recall to fall below an acceptable level during peak load. That is a more meaningful experiment. This is also where AI becomes useful. Telemetry and incident history can reveal recurring system patterns: Vector index imbalance before latency spikesCache churn before retrieval degradationRetry storms after inference gateway saturationObservability blind spots during backpressure events Instead of injecting arbitrary failure, engineers can simulate the stress signatures that actually precede operational instability. That is a very different kind of chaos engineering — one grounded in observed behavior rather than randomness. Intent Logic in Practice At a high level, the logic looks like this: YAML INTENT_SPEC: "Vector_Search_Reliability" EXPECTED_BEHAVIOR: latency_p99: < 400ms context_recall: > 0.92 CHAOS_EXPERIMENT: "Index_Partition_Failure" INJECTION: Drop 30% of Index_Shards INTENT_VALIDATION: IF context_recall < 0.80: TRIGGER: "Autonomous_Index_Rebuild" STATUS: "Intent_Preserved" ELSE: STATUS: "System_Fragile" The important thing here is not the syntax. It is the shift in philosophy. The experiment is not evaluating whether the infrastructure stayed alive. It is evaluating whether the system continued to preserve the outcome it was designed to protect. That is the level at which AI systems need to be tested. Autonomous Remediation Needs a North Star Intent also makes autonomous remediation more reliable. In many modern platforms, remediation is already automated to some degree. Systems restart services, scale resources, fail over traffic, or reroute requests when predefined thresholds are crossed. But automated recovery is only as good as the logic guiding it. Without intent, remediation is reactive. It responds to symptoms. With intent, remediation becomes directional. It knows what outcome it is trying to preserve. This is especially important in AI-driven infrastructure, where the “correct” response is not always obvious. If a retrieval system degrades, should the platform rebuild an index, switch to a fallback store, reduce concurrency, or tighten context filters? The answer depends on the operational intent of the service. Intent becomes the system’s North Star. That is what makes self-healing architecture more than just automation. It gives the platform a decision framework. Why This Is Safer for Production One of the biggest objections to chaos engineering in enterprise settings is safety. That concern is fair. Random failure injection in production can be hard to justify, especially in systems that support regulated workloads, customer-facing AI experiences, or security-sensitive operations. Intent-based chaos is safer because it is narrower and more accountable. It does not ask teams to break things blindly. It asks them to define acceptable operating boundaries, simulate realistic threats to those boundaries, and verify whether the platform can recover without violating core expectations. In that sense, intent-based chaos is closer to structured resilience validation than traditional disruption testing. It is a more mature model for environments where uptime alone is no longer the right measure of health. The Next Stage of Chaos Engineering Chaos engineering was originally about teaching distributed systems to survive failure. That mission has not changed. What has changed is the nature of the systems. AI infrastructure is adaptive, stateful, and deeply dependent on the quality of its intermediate behaviors. If we continue to test it with purely random failure models, we will miss the failures that matter most. The future of resilience engineering is not just about causing disruption. It is about preserving intent. That means defining what good behavior looks like, identifying the realistic stressors that threaten it, and building platforms that can detect, validate, and recover against those conditions automatically. Random chaos was a useful first chapter. For AI-driven infrastructure, the next chapter is intentional resilience.

By Sayali Patil

Demystifying Intelligent Integration: AI and ML in Hybrid Clouds

The article explores the transformative impact of AI and ML in hybrid cloud environments, challenging traditional cloud solutions. Key topics include the role of edge AI in industries like manufacturing and autonomous vehicles, the innovative use of federated learning to address data sovereignty, and the cross-industry potential of AI-driven integration, particularly in agriculture. It highlights the importance of explainable AI for transparency and compliance, especially in highly regulated sectors like healthcare. The author shares personal insights on integration challenges and the effectiveness of tools like Kubernetes and Docker, while also looking at future prospects with quantum computing and 5G. A Personal Journey into the Clouds Three years ago, while sipping chai in Kolkata, I was deep in thought about the limitations we faced with traditional cloud solutions. The realization hit me — the future does not lie in conventional cloud setups but in the dynamic and flexible world of hybrid clouds, powered by AI and ML. My journey in this domain, particularly with Mulesoft and Anypoint Platform, has been illuminating, full of challenges, and yes, quite a few late-night debugging sessions. Today, as an Associate Consultant deeply entrenched in the intricacies of hybrid cloud environments, I'm excited to share how AI and ML are not just buzzwords but catalysts for revolutionary change. 1. Edge AI: Bringing Intelligence to the Periphery I remember at a client meeting, we discussed integrating edge AI to enhance a manufacturing unit’s operations. Processing data closer to the source — at the edge — not only reduced latency but significantly boosted real-time decision-making. The manufacturing sector isn’t the only playground for this; autonomous vehicles, with their demand for immediate data processing, are also key beneficiaries. Imagine an autonomous car, miles away from a central server, decidin' the best route on-the-fly using real-time traffic data. Edge AI enables such scenarios by decentralizin' the data processing power, a trend I've observed increasingly during my time with Farmers Insurance. 2. A Contrarian Take on Data Sovereignty During a project involving a healthcare application, I was on the front lines of navigating data residency laws. Conventional wisdom preaches strict data localization — keepin' data within national borders. However, I've found flexibility through federated learning. By anonymizing datasets and distributing learning tasks, we maintained compliance while pushin' boundaries in innovation. This approach, although occasionally questioned, provided insights that traditional data handling could not, particularly in sensitive sectors like finance. 3. AI-Driven Integration: Beyond IT into Agri-Tech Agriculture might seem worlds apart from the tech world, but AI integration in hybrid clouds is closing that gap at an astonishing pace. I recall a pilot project where predictive models, fueled by AI, transformed supply chain efficiency for crop yields. We leveraged historical data and real-time environmental inputs to forecast supply needs, thus reducing waste and enhancing productivity. This cross-industry application emphasized to me the versatility of AI-driven integration, extending far beyond just software domains. 4. XAI: The Transparent Cloud In one of the more challenging phases of my projects, I confronted a client's demand for transparency in AI-driven decisions. Explainable AI (XAI) came to our rescue. Integrating XAI into hybrid cloud environments demystifies AI’s decision-making process, providing not just answers but explanations. In healthcare, where every decision can be life-altering, this transparency is not just beneficial but essential. Our deployment with XAI ensured compliance and built trust — a key takeaway for any regulated industry. 5. Navigating the Current Market Dynamics Let's be real: integrating AI/ML with hybrid clouds isn't a walk in the park. Many organizations face integration challenges, from disparate data formats to latency woes. I’ve often found myself in meetings where the main concern was ensuring seamless data flow between on-prem and cloud resources. Tools like Kubernetes and Docker have been invaluable, facilitating container orchestration that streamlines AI model deployment, despite these hurdles. My advice? Start small, pilot your integrations before scaling up — a lesson learned from a complex integration scenario with a major insurance provider. 6. Future-Proofing with Quantum Computing and 5G As if AI and ML weren't exciting enough, quantum computing and 5G are set to propel hybrid cloud capabilities to new heights. The idea of utilizing real-time language translation or predictive maintenance within IoT ecosystems isn't just science fiction — it's right around the corner. I’ve dabbled a bit with quantum concepts, and though the learning curve is steep, the potential to disrupt traditional models and create new market leaders is immense. Concrete Examples and Case Studies One standout project involved integrating AI models to optimize a logistics network. The challenge was ensuring consistent performance across both on-premises and cloud environments. Despite initial hiccups with data latency and format mismatches, using the Mulesoft Anypoint Platform, we created a unified, seamless system. This integration not only boosted operational efficiency but also significantly reduced costs — a win-win! Personal Insights and Lessons Learned Navigating these waters, my most significant realization is that technology alone isn’t a panacea. It's about strategy, understanding client needs, and knowing when to pivot. Adopting a contrarian view on data residency, for example, opened doors once considered locked. In this ever-evolving landscape, being adaptable is key. Actionable Takeaways Embrace Federated Learning: It’s a game-changer for data sovereignty concerns.Start with XAI: Build trust by allowing stakeholders to see the decision logic.Pilot with Edge AI: Especially in sectors needing real-time processing, like automotive or healthcare.Stay Ahead with Quantum Computing: Begin understanding its implications for future integrations. Conclusion: Architecting the Future-Ready Systems As we architect future-ready systems, blending AI and ML with hybrid cloud environments, the key is to remain curious and open to learning. My stints with various projects, from insurance giants to a farmer's forecast, reinforce the fact that the future is hybrid — and intelligent. While challenges abound, the rewards are manifold for those willing to embrace this dynamic landscape with a little bit of grit and a whole lot of innovation.

By Abhijit Roy

The DevOps Security Paradox: Why Faster Delivery Often Creates More Risk

A few years ago, I was part of a large enterprise transformation program where the leadership team proudly announced that they had successfully implemented DevOps across hundreds of applications. Deployments were faster.Release cycles dropped from months to days.Developers were happy. But within six months, the security team discovered something alarming. Misconfigured cloud storage.Exposed internal APIs.Containers running with root privileges.Unpatched base images being deployed daily. Ironically, the same DevOps practices that accelerated innovation had also accelerated risk. This is the DevOps Security Paradox. The faster organizations move, the easier it becomes for security gaps to slip into production. The Velocity vs Security Conflict Traditional software delivery worked like a relay race. Developers wrote the code. Operations deployed it. Security reviewed it near the end. DevOps changed that model entirely. Instead of a relay race, delivery became a high-speed continuous conveyor belt. Code moves through: Source controlCI pipelinesContainer buildsInfrastructure provisioningProduction deployment Sometimes this entire journey happens in minutes. The problem is that security processes did not evolve at the same speed. Many organizations still rely on: Manual reviewsSecurity gates late in the pipelinePeriodic compliance audits By the time issues are discovered, the code is already running in production. The Hidden Security Gaps in Modern DevOps In my experience working with cloud and DevOps teams, most security issues come from a few recurring patterns. 1. Infrastructure as Code Without Guardrails Infrastructure as Code (IaC) is powerful. Teams can provision entire environments with a few lines of code. But this also means developers can accidentally deploy insecure infrastructure at scale. Common issues include: Public S3 bucketsSecurity groups open to the internetDatabases without encryptionMissing network segmentation Because IaC is automated, one mistake can replicate across hundreds of environments instantly. 2. Container Security Is Often Ignored Containers made application packaging simple, but they also introduced new attack surfaces. Many container images in production today still include: Outdated base imagesHundreds of unnecessary packagesCritical vulnerabilities Developers often pull images from public registries without verification. A single vulnerable dependency can quietly introduce risk into the entire platform. 3. CI/CD Pipelines Become a Security Blind Spot CI/CD pipelines now have enormous power. They can: Access source codeBuild artifactsPush imagesDeploy to productionAccess cloud credentials Yet pipelines are rarely treated as high-value targets. Common risks include: Hardcoded secretsOver-privileged IAM rolesLack of pipeline integrity verificationUntrusted third-party actions A compromised pipeline can become the fastest route to compromise production systems. 4. Identity and Access Sprawl Cloud environments grow quickly. What starts with a few roles and service accounts soon becomes hundreds. Without strong identity governance, teams end up with: Overly permissive IAM rolesLong-lived credentialsUnused service accountsCross-account trust misconfigurations Identity is now the primary attack vector in cloud environments, yet it remains one of the least governed areas. Why Security Teams Struggle to Keep Up The reality is that most security teams were never designed for the pace of DevOps. Traditional security approaches rely heavily on: Ticket-based reviewsStatic compliance checklistsQuarterly audits But modern cloud environments change daily. A Kubernetes cluster may create or destroy hundreds of resources every hour. Manual reviews simply cannot scale. Security must evolve from manual inspection to automated enforcement. The DevSecOps Shift The solution is not slowing down DevOps. The solution is making security move at the same speed as DevOps. This is where DevSecOps becomes critical. Instead of adding security at the end, it becomes embedded throughout the delivery lifecycle. Key practices include: Policy as Code Security rules should be enforced automatically. Tools like Open Policy Agent or Kyverno allow teams to define policies such as: Containers cannot run as rootRequired resource limits must be definedPublic cloud resources must be restrictedEncryption must be enabled These policies run automatically during CI pipelines or Kubernetes deployments. Automated Security Scanning Every pipeline should automatically scan for: Container vulnerabilitiesIaC misconfigurationsDependency risksSecret leaks Developers receive immediate feedback before code reaches production. Secure CI/CD Design CI pipelines themselves must follow security best practices: Short-lived credentialsIsolated runnersSigned artifactsVerified dependencies Pipelines should be treated as critical infrastructure, not just build tools. Continuous Cloud Posture Monitoring Even with preventive controls, misconfigurations still happen. Continuous monitoring tools help detect issues such as: Public resourcesIAM privilege escalation risksCompliance violationsDrift from security baselines Security becomes an ongoing process rather than a periodic audit. Culture Matters More Than Tools One of the biggest lessons I’ve learned after two decades in the industry is this: Security failures rarely happen because tools are missing.They happen because security is treated as someone else's responsibility.When developers view security as a blocker, they find ways to bypass it. But when security is built into the developer workflow, it becomes part of normal engineering. Successful DevSecOps cultures usually follow three principles: Security feedback must be immediateSecurity controls must be automatedSecurity must empower developers, not slow them down The Future of Secure DevOps Over the next few years, we will see security becoming deeply integrated into engineering platforms. Some trends are already emerging: Secure Software Supply ChainsSigned container artifactsZero Trust cloud architecturesPolicy-driven infrastructureAI-assisted security detection Organizations that succeed will not treat security as a checkpoint. They will treat it as an automated system woven into the fabric of their delivery platforms. Final Thoughts DevOps changed how we build and deliver software. But it also changed how attackers find opportunities. Speed without security creates fragile systems. The organizations that thrive will be those that learn to balance velocity with resilience. DevOps helped us move faster. DevSecOps ensures we move fast without breaking trust. Stay Connected If you found this article useful and want more insights on Cloud, DevOps, and Security engineering, feel free to follow and connect.

By Jaswinder Kumar

Seeing the Whole System: Why OpenTelemetry Is Ending the Era of Fragmented Visibility

The incident had been running for forty-seven minutes when I watched the on-call engineer open his sixth browser tab. Grafana for the infrastructure metrics. Splunk for the application logs. A separate Jaeger instance — legacy, running on a server that was itself poorly monitored — for traces from the API layer. A custom dashboard someone had built in Kibana eighteen months earlier for the payment service, which used a different logging format than everything else. And a Datadog trial that a team had spun up six weeks prior for a new microservice, not yet integrated with anything. He wasn't incompetent. He was experienced, methodical, and clearly doing his best under pressure. The problem was that the answer — a cascade that had started when a downstream dependency began timing out under load, causing queue depth to grow on a service that nobody had instrumented with queue metrics — was distributed across four systems that had no awareness of each other. He had to hold the context in his head. Manually. While an incident was live. They found the root cause at minute sixty-one. The customer-facing impact had lasted forty-four of those minutes. The postmortem identified the observability fragmentation as a contributing factor, listed it under "areas for improvement," and moved on to the next agenda item. I've watched variations of that scene in a half-dozen organizations over the past two years. The tooling changes. The services change. The outcome — an engineer assembling context manually from disconnected systems while something is actively broken — remains depressingly consistent. The Silo Problem Nobody Talks About Honestly Here is the honest history of how most engineering organizations arrived at their current monitoring stack: incrementally, by accident, without design. A team needs metrics. They stand up Prometheus. Another team is doing distributed tracing and chooses Jaeger because a consultant recommended it in 2021. The security team wants log aggregation and procures an ELK deployment. A new service gets built by an engineer who prefers Datadog and expenses a trial. An acquired company brings its own observability tooling in the merger. Nobody made a bad decision in isolation. The aggregate result is four or five disconnected systems, each with partial visibility into the environment, none of which speak to each other. The cost of this architecture isn't obvious until an incident. In steady state, the fragmentation is an inconvenience — a bit of extra work to check multiple dashboards, some duplicated alerting logic, occasional inconsistencies between what different systems report. Engineers adapt. Runbooks get written that specify which tab to open first. Then something goes wrong in a way that crosses system boundaries — which, in a microservices environment, is basically every interesting incident — and the cost becomes immediate and concrete. The trace context doesn't propagate from the service instrumented with one agent to the service instrumented with another. The log timestamp in one system doesn't align with the metric spike in the other, and you spend eight minutes ruling out whether the difference is a timezone issue or a real sequence. The tool that would answer your question doesn't have the data because that service was never instrumented for it. The question OpenTelemetry is answering — slowly, imperfectly, but at a scale that suggests genuine momentum — is whether the industry can agree on a common foundation for telemetry that makes this fragmentation a choice rather than an inevitability. What OpenTelemetry Actually Is, Stripped of the Hype The CNCF project's ambitions are larger than its name implies. OpenTelemetry isn't primarily a tool. It's a specification, a set of APIs, a collection of SDKs across most major languages, and a Collector — a standalone service that receives, processes, and routes telemetry — that together constitute a vendor-neutral foundation for how applications produce and transmit observability data. The practical significance of "vendor-neutral" is easy to understate. Before OpenTelemetry reached maturity — and it only really reached meaningful production stability in its core components sometime in 2023 — instrumenting an application for observability meant tying yourself to a specific vendor's agent or SDK. Switch from Datadog to Honeycomb, or from Jaeger to a commercial backend, and you were re-instrumenting. Not just reconfiguring — actually touching code, removing one library, adding another, retesting. With OpenTelemetry, the instrumentation in application code emits to a standard protocol: OTLP, the OpenTelemetry Protocol. The Collector receives that data and routes it wherever you configure. Change your backend, change the Collector configuration. The application code doesn't know and doesn't care. This portability is real and I've watched organizations use it in practice. A fintech company in São Paulo that I spent time with in mid-2025 had been running Jaeger for distributed tracing. Their compliance team needed traces available in a system their auditors could access with enterprise-level controls — specifically, a commercial vendor's platform. Because their instrumentation was already OTel-native, the migration was a Collector configuration change and a two-day integration project. The engineers were visibly surprised it went that smoothly. Their previous vendor migration, before OTel, had taken three months. The Adoption Numbers and What They Mean EMA Research published figures in 2025 that I found genuinely striking for a project that was still cutting release candidates as recently as 2022: nearly half of organizations surveyed reported active OpenTelemetry usage in production, with another quarter indicating planned adoption. Grafana's observability survey from the same period showed Prometheus running at 67% adoption — its established position — while OpenTelemetry had closed to 41%, an extraordinary trajectory for a project that was pre-1.0 on most signals until 2023. What explains that velocity? Partly the backend consolidation play — organizations that have already committed to multiple observability vendors simultaneously see real value in a neutral collection layer. Partly the engineering community's attraction to open standards over proprietary lock-in, which has only intensified as vendor pricing for high-cardinality metrics and traces has become a genuine budget line item. And partly, I think, the slow accumulation of platform engineering investment described above — teams that are already thinking about their infrastructure as a product are more likely to make deliberate observability decisions rather than accumulating tools reactively. The 84% of OTel adopters reporting meaningful cost reductions figures that surface in EMA's research are worth treating carefully — vendor-adjacent surveys have obvious incentive structures — but the cost argument has a structural logic independent of any survey. When you centralize telemetry collection through a Collector with sampling and filtering capabilities, you gain control over what you're actually sending to backends. A common pattern I've seen in large-scale deployments: teams instrument comprehensively at the source, then configure tail-based sampling at the Collector level to send perhaps 10 to 15 percent of traces to expensive storage backends while retaining 100 percent of errored or slow traces. The result is complete visibility into what's actually going wrong, at a fraction of the ingestion cost of sending everything everywhere. The Collector as the Linchpin Of all OpenTelemetry's components, the Collector is the one I've watched teams misunderstand most consistently — both underinvesting in it and overcomplicating it. The underinvestment failure mode: treat the Collector as a pass-through and configure it to forward everything to a single backend without filtering, sampling, or enrichment. This works. It also eliminates most of the architectural benefit of centralizing collection in the first place. A Collector that simply relays raw telemetry is better than per-service direct export to a vendor — at least your configuration is centralized — but it's not capturing the value of having a processing layer in the pipeline. The overcomplication failure mode: attempt to route telemetry to five backends simultaneously from day one, with complex processor chains, multiple sampling strategies, and attribute transformations that nobody fully understands six months later. I've seen this create Collector configurations that are harder to reason about than the systems they're observing, maintained by one engineer who has become the de facto owner of something that should be team-legible infrastructure. The teams that do this well — and the pattern is consistent enough that I've started calling it out explicitly in conversations — start with one receiver, one processor, one or two exporters, and a clear ownership model. They expand the pipeline deliberately, treating each new processor or export target as a discrete decision with documented rationale. Their Collector configuration is in Git. Changes go through review. The observability pipeline is itself observable: they watch the Collector's own health metrics for export latencies and drop rates. An SRE manager at a US-based SaaS company described this to me in September 2025 with unusual clarity: "We treat the Collector like a service. It has an owner, it has SLOs, it has an on-call rotation. When we first deployed it we treated it like infrastructure — just set it up and forgot about it. That lasted until it became a single point of failure for our entire telemetry path during an incident and we had no visibility into why." The Correlation Problem That OTel Mostly Solves The deepest value of a unified telemetry standard isn't the cost savings or the backend portability. It's correlation — the ability to move from a metric anomaly to the trace that explains it to the log line that identifies the specific operation. Before unified context propagation, this was manual. You saw a latency spike in your metrics, pulled up your tracing tool, searched by time window and service name, found the relevant traces — maybe — then looked for correlated logs by timestamp, hoping the clocks were synchronized and the log levels were informative enough to be useful. For an experienced engineer who knew all the systems, this might take five minutes. For someone less familiar with the environment, or dealing with an unfamiliar failure mode, it could take much longer. OpenTelemetry's trace context propagation — the traceparent header that flows through HTTP calls between services, automatically attached by OTel SDKs — makes correlation mechanical. A single trace ID links the request path across every service it touched. If your logs are also emitting that trace ID — which OTel log instrumentation handles — you can navigate from a slow span in a trace directly to the log lines produced during that span, in the same system, with a single click in any backend that supports the correlation. I watched a junior engineer at a retailer do a root-cause analysis last November that, by the on-call lead's estimate, would have taken forty minutes before their OTel migration. It took nine. She had been on the team for three months. She'd never seen the failure mode before. The trace context gave her a path through the system that she could follow without needing to know in advance which service to look at next. That's the promise that the observability conversation has been making for five years. OpenTelemetry is the first time I've watched it delivered consistently enough, in enough organizations, to stop treating it as aspirational. What Remains Hard Honesty requires acknowledging the parts that haven't gotten easier. Auto-instrumentation — OTel's mechanism for capturing telemetry from common libraries without code changes — is excellent for standard HTTP calls, database queries, and gRPC. It's considerably less useful for anything proprietary or unusual: custom message queue implementations, legacy protocols, in-house frameworks built before any of this existed. Teams with significant legacy surface area still face manual instrumentation work that is unglamorous and time-consuming. Log integration is the signal that has lagged furthest. Traces and metrics in OTel are mature and stable. The logging specification and its SDK implementations have been catching up, and the situation is meaningfully better in early 2026 than it was eighteen months ago, but organizations with established logging pipelines face real migration complexity if they want fully correlated logs under OTel. The teams I've seen navigate this most smoothly have done it incrementally: add trace and span IDs to existing log output first, then migrate the collection path when the operational picture is clearer. And the Collector's operational complexity is real. It's not prohibitive, but it's not invisible either. A production Collector deployment handling high-cardinality telemetry from dozens of services is infrastructure that requires capacity planning, failure mode analysis, and ongoing operational attention. Teams that assume the Collector is a set-and-forget component inevitably discover otherwise. The Visibility You Don't Have Is the One That Matters I've thought about that engineer in the war room often over the past year. Six browser tabs, forty-seven minutes, an incident that was answerable by the data that existed — it just existed in the wrong places. The case for unified observability isn't primarily theoretical. It's the accumulated cost of every incident that ran longer than it needed to because context was scattered, every postmortem that identified observability gaps as a contributing factor and then filed that finding away, every junior engineer who couldn't navigate an unfamiliar system under pressure because there was no coherent thread to follow. OpenTelemetry doesn't eliminate incidents. It doesn't make systems less complex. What it does — when it's implemented thoughtfully, with a Collector that's treated as real infrastructure and instrumentation that covers the services that actually matter — is make the complexity legible. One data model. One propagation standard. One collection pipeline. Backends that can be swapped without touching application code. For an industry that has been drowning in its own telemetry for the better part of a decade, that's not nothing. The author covers cloud infrastructure, reliability engineering, and distributed systems for enterprise technology organizations. They have reported from engineering teams across North America, Europe, and South America over fifteen years.

By Igboanugo David Ugochukwu

CORE

Quality Assurance in AI-Driven Business Evolution

Why do most intelligent systems fail when they hit production? It's rarely because of a weak algorithm. Instead, it's usually a testing framework stuck in a bygone era. If you're still running "Expected vs. Actual" spreadsheets for non-deterministic models, you're trying to measure a cloud with a ruler. The reality is that traditional quality checks create a false sense of security. This leads to failures in live environments. You've got to stop testing for a single "correct" answer. It's time to start testing for the boundaries of acceptable behavior. The Foundation of Modern AI Quality AI Quality Assurance is the systematic verification of probabilistic systems to ensure they remain reliable, ethical, and performant as they evolve. Unlike legacy software, these systems change based on the data they ingest. This makes static testing essentially useless. The shift toward AI TRiSM (Trust, Risk, and Security Management) is the core of this new environment. It moves beyond simple bug hunting to focus on the long-term integrity of your tech stack. By analyzing how models interact with fluctuating data, you'll ensure your modernization stays safe by eliminating faulty data outputs and biased model behavior. You're no longer just checking lines of code. You're auditing the entire lifecycle of the decision-making process. This requires a shift in how we think about the health of a system. The AIMS Framework: ISO/IEC 42001 The ISO/IEC 42001 - AI Management System (AIMS) is the primary international standard for governing these projects (ISO/IEC 42001:2023). It's a roadmap for managing risks and opportunities. When you implement an AIMS, you're not just testing a product. You're institutionalizing a quality culture that spans from data acquisition to model retirement. It provides the structure needed to scale without losing control. NIST AI Risk Management Pillars To maintain high standards, you should deploy the NIST AI Risk Management Framework (AI RMF) (NIST AI RMF 1.0, 2023). This framework uses functional pillars: Govern: Embed risk management into the daily developer workflow, so it's not an afterthought.Map: Categorize the AI context to identify specific risks before they happen.Measure: Use quantitative and qualitative methods to assess if the system is actually trustworthy.Manage: Prioritize and respond to risks based on how they impact the business and the end-user. Why Metamorphic Testing is the New Standard Metamorphic testing is a technique that validates the relationship between multiple inputs and outputs rather than verifying a single, static result. Traditional testing fails AI because you often lack a "ground truth," which experts call the Oracle Problem. If an AI predicts a mortgage rate, you can't manually recalculate every single permutation. It's too complex for a spreadsheet. So, how do we know if the logic holds up? Instead, we use metamorphic relations. For example, if you increase a user's credit score in a test case, the AI's predicted interest rate should logically decrease or stay the same. If the rate increases, you've hit a metamorphic violation. This approach verifies non-deterministic systems where the "correct" answer is a range, not a single point. This is now the standard for verifying modern AI-led shifts. Technical Implementation: Metamorphic Relation (MR) Plain Text # Pseudo-code for a Metamorphic Relation in Credit Scoring def test_metamorphic_credit_logic(model, base_input): # Relation: Higher Credit Score -> Lower or Equal Interest Rate output_1 = model.predict(base_input) modified_input = base_input.copy() modified_input['credit_score'] += 50 output_2 = model.predict(modified_input) assert output_2 <= output_1, f"MR Violation: Rate increased from {output_1} to {output_2}" Testing for Bias and Fairness ISO/IEC TR 29119-11 provides a checklist for bias testing. In AI-driven evolution, quality equals equity. If your system's biased, it's not high quality — it's a liability. You should use tools like AI Fairness 360 to perform regular fairness audits. These ensure your AI project does not inadvertently exclude demographic groups due to flawed training data. It's about protecting both the user and the brand. Performance Under Data Loads Neural networks require heavy stress testing against messy or incomplete data. In the real world, data is rarely clean. Fault-tolerant systems must be designed to fail gracefully rather than crashing or providing irrelevant outputs. You must verify that the model does not provide a high-confidence, incorrect answer when it encounters out-of-distribution (OOD) data. If the AI doesn't know the answer, it should be able to say so. The Strategic Shift to Data-Centric QA Data-Centric QA is the process of verifying training and testing datasets to ensure model output remains consistent with real-world drift. In the past, QA teams focused on the UI and backend logic. In AI-led shifts, the data is the logic. Data Lineage and Drift If data drifts — meaning real-world data diverts from what was used in training — performance will degrade. It's not a matter of if, but when. Modern QA teams monitor Data Drift using statistical tests like Kolmogorov-Smirnov (KS) or Population Stability Index (PSI). You've got to ensure your data pipeline is as resilient as your code pipeline. If the foundation moves, the house will fall. The Role of Agentic QA Engineers The Agentic QA Engineer is a new expert tier in the workforce. They focus on autonomous "AI Agents" that execute multi-step workflows. Testing an agent is a different process entirely. It requires simulating complex environments where the agent makes sequential decisions. Your job is to ensure the agent doesn't hallucinate a step or take unethical shortcuts to reach a goal. It's about supervising the decision-making path. Action Steps for Implementing AI Quality Assurance Conduct a Gap Analysis: Use the NIST AI RMF to find where your current tests fail to cover probabilistic outcomes.Implement an AIMS: Adopt ISO/IEC 42001 to establish clear accountability across your teams.Deploy Metamorphic Testing: Define relationships between inputs for your most critical models. This helps catch bugs that assertion-based testing misses.Setup Data Observability: Integrate monitors for data drift and lineage to prevent model decay before it hits the user.Train for Adversarial Prompting: Educate your QA team on Adversarial Prompting. Check the OWASP LLM Top 10 to test the strength of the system against prompt injection.Adopt Visual AI: Integrate tools into your frontend regression suites. This eliminates brittle tests that break on minor UI updates.Establish Human-in-the-Loop (HITL): Create a process for human experts to review edge cases flagged by the AI. This ensures ethical compliance and improves precision over time. Conclusion: Quality as the Engine of Transformation Quality Assurance in AI-Driven Business Evolution is not a final hurdle. It's the engine that makes the whole shift possible. By adopting ISO/IEC 42001 and metamorphic testing, you move from hoping it works to knowing it's reliable. Transitioning from code-centric to data-centric quality is the only way to manage the complexity of intelligent systems. Don't just test for pass or fail — test for trust. Your digital future depends on it.

By Parimal Kumar

Cybersecurity with a Digital Twin: Why Real-Time Data Streaming Matters

Cyberattacks on critical infrastructure and manufacturing systems are growing in scale and sophistication. Industrial control systems, connected devices, and cloud services expand the attack surface far beyond traditional IT networks. Ransomware can stop production lines, and manipulated sensor data can destabilize energy grids. Defending against these threats requires more than static reports and delayed log analysis. Organizations need real-time visibility, continuous monitoring, and actionable intelligence. This is where a digital twin and data streaming come together: digital twins provide the model of the system, while a Data Streaming Platform ensures that the model is accurate and up to date. The combination enables proactive detection, faster response, and greater resilience. The Expanding Cybersecurity Challenge Cybersecurity is becoming more complex in every industry. It is not only about protecting IT networks anymore. Industrial control systems, IoT devices, and connected supply chains are all potential entry points for attackers. Ransomware can shut down factories, and a manipulated sensor reading can disrupt energy supply. Traditional approaches rely heavily on batch data. While many logs are collected on a continuous basis or in micro-batches, systems struggle to act on them as quickly. Reports are generated every few hours. Many organizations also still operate with legacy systems that are not connected or digital at all, making visibility even harder. This delay leaves organizations blind to fast-moving threats. By the time the data is examined, the damage is already be done. Supply Chain Attacks Supply chains are now a top target for attackers. Instead of breaking into a well-guarded core system, they exploit smaller vendors with weaker defenses. A single compromised update or tampered data feed can ripple through thousands of businesses. The complexity of today’s global supply networks makes these attacks hard to detect. With batch-based monitoring, signs of compromise often appear too late, giving threats hours or days to spread unnoticed. This delayed visibility turns the supply chain into one of the most dangerous entry points for cyberattacks. Digital Twin as a Cybersecurity Tool A digital twin is a virtual model of a real-world system. It reflects the current state of assets, networks, or operations. In a cybersecurity context, this creates an environment where organizations can: Simulate potential attacks and test defense strategies.Detect unusual patterns compared to normal system behavior.Analyze the impact of changes before rolling them out. But a digital twin is only as good as the data feeding it. If the data is outdated, the twin is not a reliable representation of reality. Cybersecurity demands live information, not yesterday’s snapshot. The Role of a Data Streaming Platform in Cybersecurity with a Digital Twin A Data Streaming Platform (DSP) provides the backbone for digital twins in cybersecurity. It enables organizations to: Ingest diverse data in real time: Collect logs, sensor readings, transactions, and alerts from different environments — cloud, edge, and on-premises.Process data in motion: Apply filtering, transformation, and enrichment directly on the stream. For example, match a login event with a user directory to check if the access is suspicious.Detect anomalies at scale: Use stream processing engines like Apache Flink to identify unusual patterns. For instance, hundreds of failed login attempts from a single IP can trigger an alert within milliseconds.Provide governance and lineage: Ensure that sensitive data is secured, access is controlled, and the entire flow is auditable. This is key for compliance and forensic analysis after an incident. A key advantage is that a Data Streaming Platform is hybrid by design. It can run at the edge to process data close to machines, on premises to integrate with legacy and sensitive systems, and in the cloud to scale analytics and connect with modern AI services. This flexibility ensures that cybersecurity and digital twins can be deployed consistently across distributed environments without sacrificing speed, scalability, or governance. Learn more about Apache Kafka cluster deployment strategies. For a deeper exploration of these data streaming concepts, see my dedicated blog series about data streaming for cybersecurity. It covers how Kafka supports situational awareness, strengthens threat intelligence, enables digital forensics, secures air-gapped and zero trust environments, and modernizes SIEM and SOAR platforms. Together, these patterns show how data in motion forms the backbone of a proactive and resilient cybersecurity strategy. Kafka and Flink as the Open Source Backbone for Cybersecurity at Scale Apache Kafka and Apache Flink form the foundation for streaming cybersecurity architectures. Kafka provides a scalable and fault-tolerant event backbone, capable of ingesting millions of messages per second from logs, sensors, firewalls, and cloud services. Once data is available in Kafka topics, it can be shared across many consumers in real time without duplication. Flink complements Kafka by enabling advanced stream processing. It allows continuous analysis of data in motion, such as correlation of login attempts across systems or stateful detection of abnormal traffic flows over time. Instead of relying on batch jobs that check logs hours later, Flink operators evaluate security patterns as events arrive. This combination of Kafka as the durable, distributed event hub and Flink as the real-time processing engine is central to modern security operations platforms, SIEMs, and SOAR systems. It is the shift from static analysis to live situational awareness. With Kafka and Flink, a digital twin can mirror networks, devices, and processes in real time, detect deviations from expected behavior, and support proactive defense against cyberattacks. The result is a shift from static analysis to live situational awareness and actionable insights. Kafka Event Log as Digital Twin with Ordering, Durability, and Replay A digital twin is only useful if it reflects reality in the right order. Kafka’s event log delivers this with ordering, durability, and replay. Event Log as a Live Digital Twin Kafka’s append-only commit log creates a living record of every event in exact order. This is critical in cybersecurity, where sequence shows cause and effect, not just data points. In network traffic, ordered events reveal brute-force attacks by showing retries in order. Industrial command logs show whether shutdowns were legitimate or malicious. Ordered login attempts expose credential-stuffing. Without this timeline, patterns vanish, and analysts lose context. This is a major advantage of Kafka compared to other cyber data pipelines. Tools like Logstash or Cribl can move data to a SIEM, SOAR, or storage system, but they lack Kafka’s durable, fault-tolerant log. When nodes fail, these tools can lose data. Many cannot replay data at all, or they replay it out of order. Replay and Long-Term Forensics Kafka enables reliable event replay for forensics, simulation, and audits. Natively integrated into long-term storage such as Apache Iceberg or cloud object stores, it supports both real-time defense and deep historical analysis. Its fault-tolerant log preserves ordered event data, allowing teams to reconstruct attacks, validate detections, and train AI models on complete histories. This continuous access to accurate event streams turns the digital twin into a trusted source of truth. The result is stronger compliance, fewer blind spots, and faster recovery. Kafka ensures that security data is not only captured but can always be replayed and verified as it truly happened. Diskless Kafka: Separating Compute and Storage Diskless Kafka removes local broker storage and streams event data directly into object storage such as Amazon S3. Brokers become lightweight control planes that handle only metadata and protocol traffic. This separation of compute and storage reduces infrastructure costs, simplifies scaling, and maintains full Kafka API compatibility. The architecture fits cybersecurity and observability use cases especially well. These workloads often require large-scale near real-time analytics, auditing, and compliance rather than ultra-low latency. Security and operations teams benefit from the ability to retain massive event histories in cheap, durable storage while keeping compute elastic and cost-efficient. Modern data streaming services like WarpStream (BYOC) and Confluent Freight (Serverless) follow this diskless design. They deliver Kafka-compatible platforms that provide the same event log semantics but with cloud-native scalability and lower operational overhead. For observability and security pipelines that must balance cost, durability, and replay capability, diskless Kafka architectures offer a powerful alternative to traditional broker storage. Confluent Sigma: Streaming Security with Domain-Specific Language (DSL) and AI/ML for Anomaly Detection Confluent Sigma is an open-source implementation that brings these concepts closer to practitioners. It combines stream processing with Kafka Streams for data-in-motion processing with an open DSL for the expression of patterns. The power of Sigma is that enables free exchange of known threat patterns rapidly across the community. With Sigma, security analysts can define detection rules using familiar constructs, while Kafka Streams executes them at scale across live event data. For example, a Sigma rule might detect unusual authentication patterns, enrich them with user metadata, and flag them for investigation. SOC Prime is a leading commercial entity behind Sigma. They have built a commercial offering on top of the Confluent Sigma project, adding machine learning that classifies events deviating from normal system behavior. This architecture is designed to be both powerful and accessible. Analysts define rules in Sigma; Kafka Streams (in this example implementation) or Apache Flink (recommended especially for stateful workloads and/or scalable cloud services) ensure continuous evaluation; machine learning identifies subtle anomalies that rules alone may miss. The result is a flexible framework for building cybersecurity applications that are deeply integrated into a Data Streaming Platform. Example: Real-Time Insights for Energy Grids and Smart Meters Energy companies often operate across millions of smart meters and substations. Attackers may try to inject false readings to disrupt billing or even destabilize grid control. With batch data, these attacks might remain hidden for days before anyone notices abnormal consumption patterns. A Data Streaming Platform changes this picture. Every meter reading is ingested in real time and fed into Kafka topics. Flink applications process the stream to identify anomalies, such as sudden spikes in consumption across a region or suspicious commands sent to multiple meters at once. The digital twin of the grid reflects this live state, providing operators with instant visibility. Integration with operational technology (OT) systems is essential. Leading vendors such as OSIsoft PI System (now AVEVA PI), GE Digital Historian, or Honeywell PHD collect time-series data from sensors and control systems. Connectors bring this data into Kafka so it can be correlated with IT signals. On the IT side, tools like Splunk, Cribl, Elastic, or cloud-native services from AWS, Azure, and Google Cloud consume the enriched stream for further analytics, dashboarding, and alerting. This combination of OT and IT data provides a holistic security view that spans both physical assets and digital infrastructure. Example: Connected Intelligence in Smart Factories A modern factory may operate thousands of IoT sensors, controllers, and machines connected via industrial protocols such as OPC-UA, Modbus, or MQTT. These devices continuously generate data on vibration, temperature, throughput, and quality. Each signal is a potential early indicator of an attack or malfunction. A Data Streaming Platform integrates this data flow into a central backbone. Kafka provides the scalable ingestion layer, while Flink enables real-time correlation of machine states. The digital twin of the factory is constantly updated to reflect current conditions. If an unusual command sequence appears, for example, a stop request issued simultaneously to several critical machines, streaming analytics can compare the event against normal operating behavior and flag it as suspicious. Again, data streaming does not operate in isolation. Historian systems like AVEVA PI or GE Digital remain critical for long-term storage and process optimization. These can be connected to Kafka so historical and live data are analyzed together. On the IT side, integration with SIEM platforms such as Splunk or IBM QRadar, or with cloud-native monitoring services, allows security teams to combine plant-floor intelligence with enterprise-level threat detection. By bridging OT and IT in real time, data streaming makes the digital twin more than a model. It becomes an operational tool for both optimization and defense. Business Value of Data Streaming for Cybersecurity The combination of cybersecurity, digital twins, and real-time data streaming is not just about technology. It is a business enabler. Key benefits include: Reduced downtime: Fast detection and response minimize production stops.Lower financial risk: Early prevention avoids costly damages, regulatory penalties, and brand risk that can arise from public breaches or loss of trust.Improved resilience: The organization can continue operating safely under attack.Trust in digital transformation: Executives can adopt new technologies without fear of losing control. This means cybersecurity must be embedded in core operations. Investing in real-time data streaming is not optional. It is the only way to create the situational awareness needed to secure connected enterprises. Building Trust and Resilience with Streaming Cybersecurity Digital twins provide visibility into complex systems. Data streaming makes them reliable, accurate, and actionable. Together, they form a powerful tool for cybersecurity. A Data Streaming Platform such as Confluent integrates data sources, applies continuous processing, and enforces governance. This transforms cybersecurity from reactive defense to proactive resilience. Explore the entire data streaming landscape to find the right open source framework, software product, or cloud service for your use cases. Organizations that embrace real-time data streaming will be prepared for the next wave of threats. They will protect assets, maintain trust, and enable secure growth in an increasingly digital economy.

By Kai Wähner

CORE

Hidden Cyber Threat AI Is Preparing That Some Companies Aren't Thinking About

Cyber threats are in an era where defense and attack are powered by artificial intelligence. While AI has seen a rapid advancement in recent times, it has raised concern among world leaders, policymakers and experts. Evidently, the rapid and unpredictable progression of AI capabilities suggests that their advancement may soon rival the immense power of the human brain. Thus, with the clock constantly ticking, urgent and proactive measures need to be set in place to mitigate unforeseen, looming future risks. According to this research, Geoffrey Hinton (Winner, Nobel Prize in Physics (2024), aka "godfather of AI") has grown more worried since 2023, noting that AI advances faster than expected, excelling at reasoning and deception. Hinton warns that to stay operational, if it perceives threats to its goals, AI could be deceptive. He predicts that AI can spur massive unemployment ( replacing software engineers, routine jobs), soar profits for companies, and create societal disruption under capitalism. He estimates a 10–20% chance of human extinction by superintelligent AI within decades, emphasizing bad actors using it for harm, like bioweapons, and the need for regulation. AI is Not Slowing Down on Attacks Here are a few incidents that prove that artificial intelligence isn't slowing down on attacks: According to a report by Deep Instincts, 75% of cybersecurity professionals had to modify their strategies last year to address AI-generated incidents. According to this post on Harvard Business Reviews, spammers save about 95% in campaign costs using large language models (LLMs) to generate phishing emails. According to a post on Deloitte, Gen AI will multiply losses from deepfakes and other attacks by 32% to $40 billion annually by 2027. According to the Federal Bureau of Investigation, in 2023, crypto-related losses totalled $5.6 billion nationally, accounting for 50% of total reported losses from financial fraud complaints. Imagine how much more was lost from 2024-2025. Hidden Dooms AI is Preparing That Some Companies Are Yet to See Widespread Disruption: The advancement in AI technology is gradually turning AI to a double-edged sword. AI can be used to launch a sophisticated cyberattack that could cause a widespread disruption to critical infrastructure, financial systems and other key sectors within a company and beyond. No wonder, David Dalrymple, an AI safety expert, warns that AI advancement is moving super fast, with the world potentially running out of time for safety preparation. Social Manipulation: It's no longer news that AI has so many fascinating advantages but companies need to have a deep understanding of it, so as not to be doomed by it. Gary Marcus, an AI critic and cognitive scientist, warns that current LLMs are dishonest, unpredictable and potentially dangerous. He further notes that one of the real harms AI is capable of is psychological manipulation, which can be leveraged by attackers to socially manipulate public opinions, spread misinformation that could lead to social unrest and destabilization of company and society. Advent of Superintelligence and Control Problem: With AI, the possibility of creating a Superintelligent agent that surpasses human intelligence (the Creator) is raising eyebrows. Yoshua Bengio said in a Wall Street Journal post, “If we build machines that are way smarter than us and have their own preservation goal, then we are creating a competitor to humanity smarter than us”. Unfortunately, the created Superintelligent AI lacks human ethics and would eventually view humans as obstacles to its goal. That way, humanity won’t be able to control the problem, potentially leading to human extinction or war. Operational Code Bloat or Flawed Value Lock-in: Literally, the AI system's function is dependent on the locked-in value that was programmed. However, with AI’s ability to generate codes, it could add in unwanted features – increasing its vulnerability or attack surface. Thus, an attacker could reprogram the AI system to sabotage via data poisoning or flawed values to pursue evil actions that are detrimental to humanity. Common Faults Caused By Companies #1: Poor Integration of GenAI Tools: The integration of third-party GenAI tools like ChatGPT and similar LLMs, without strict controls, has led to so many data leaks that could enable sabotage or espionage opportunities, as leaked data can be weaponized externally. #2: Full Reliance on AI Agents Without Human Oversight: Full reliance on agentic AI without human guidance has led to some critical accidents. According to research, transport companies such as Tesla and Uber have experienced serious incidents due to an over-reliance on AI without human oversight. #3: Poor Investment In AI Safety and Ethics: Oftentimes, when companies fail to invest in AI safety and ethics, they unknowingly leave themselves wide open to attacks. That's why DeepMinds and OpenAI highlight the importance of investing in their safety and ethics. #4: Lack of Clear Policies and Training: When a company lacks strong and clear policies for AI use and regular end-user training on AI's specific security risks, they open their doors to data leakage and prompt injections. Because even the most secure company could be compromised by an untrained or uninformed employee. #5: Poor Security and Continuous Testing: Literally, AI risk assessment shouldn't be treated as a one-time thing. But many companies fail to conduct risk assessments continuously, leading to system vulnerabilities in which adversarial prompts and data manipulation can occur. How Companies Should Prepare For 2026 Attacks Considering the rate at which the threat landscape is rapidly evolving, companies need to adopt a multilayered defense approach to closely match the kind of tumultuous attacks predicted to occur in 2026. And they are as follows: #1 Prepare for Emerging Threats No system can't be attacked. And yes, AI can attack an AI system. It's safer to prepare ahead by setting these three factors straight: Develop an incident response plan for your company’s defense.Conduct regular security training for employees. And trainers should focus on teaching employees how to treat AI agents as actors with their own identities and how to implement Identity and Access Management (IAM) control to prevent unauthorized access.Educate the company C-Suites on AI-risk as a board-level issue. #2 Develop a Comprehensive AI Policy and Procedure Companies should develop a policy and procedure for the secure and ethical use of AI within their organization. This policy includes defining a role for AI oversight, ensuring data privacy, and implementing access control for AI systems. #3 Automate Security Hygiene and Adopt Continuous Monitoring This is another way to prepare against AI attacks in 2026. By automating a routine task like vulnerability scanning, patch and configuration management reduces the window of attacks. Moreover, intense monitoring of AI agent behaviour and interactions is an ideal way to track unusual activity that could indicate an attack. #4 Have Red Team Test Weaknesses and Share Threat Intelligence Considering the sophisticated nature of AI attacks on companies, it's advisable to have a Red team run a test simulation of AI attacks to identify weak centres. While it's much better for companies to find their weaknesses themselves than for attackers to discover their weak spots, having firsthand information on the latest AI threat from other external sources like ISACs (Information Sharing and Analysis Centre) is another way to prepare for AI attacks.

By Francis Ejiofor