Welcome to the Data Engineering category of DZone, where you will find all the information you need for AI/ML, big data, data, databases, and IoT. As you determine the first steps for new systems or reevaluate existing ones, you're going to require tools and resources to gather, store, and analyze data. The Zones within our Data Engineering category contain resources that will help you expertly navigate through the SDLC Analysis stage.
Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.
Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
IoT, or the Internet of Things, is a technological field that makes it possible for users to connect devices and systems and exchange data over the internet. Through DZone's IoT resources, you'll learn about smart devices, sensors, networks, edge computing, and many other technologies — including those that are now part of the average person's daily life.
OBO SSO in Java Applications: Securely Calling Downstream APIs on Behalf of a User
Beyond Root Cause: Building Effective Blameless Postmortems for Cloud-Native Systems
We've been running AI agents in production across enterprise cloud support for several years now. I've watched the same pattern play out dozens of times across organizations of every size: a team builds a compelling pilot, leaders get excited, and then... it stalls. Not because the technology failed. Because the operating model was never designed for what agents actually do when they stop assisting humans and start executing work on their behalf. This isn't a failure of ambition. It's a failure of classification. Organizations treat all agent initiatives the same way, same governance, same ownership model, same success metrics — and then wonder why agents that draft emails scale easily while agents that process workflows create governance crises by agent fifty. The problem isn't building agents. The problem is that nobody designed an operating model for what agents do when they stop assisting and start executing. The Shift That Changes Everything There's a deceptively simple transition happening in enterprise AI that most architecture conversations skip over. AI agents are moving from assisting humans to executing work. On the surface, this sounds like an incremental capability improvement. In practice, it changes everything about how you govern, own, and operate them. In assist mode, the agent supports human decision-making. The human decides what to do. The human executes the action. The human is fully accountable. The governance model is familiar because it's essentially the same as any other software tool: set some usage policies, manage access, track adoption. Low risk. Familiar territory. In execute mode, the agent performs work across systems. The agent acts on decisions. The agent orchestrates multi-step workflows. The human oversees outcomes rather than approving each action. This creates four new demands that most organizations are completely unprepared for: Who is accountable for this agent? What happens when it goes wrong? Who maintains and improves it over time? What is it allowed to do and not do? These questions sound simple. In my experience, most organizations cannot answer even one of them clearly for their production agents. That's the gap. And it's why agents stall. Six Patterns, Six Operating Models The most useful insight I can share from production experience is this: not all agent initiatives are the same, and treating them the same is what breaks scale. An agent that drafts emails for individuals is a completely different organizational bet than an agent that processes support requests autonomously. They require different governance, different ownership models, different success metrics, and different levels of organizational maturity. In practice, I've found it useful to think about agent work in six distinct patterns, each with its own operating requirements. These are design choices, not stages; most organizations run two or three simultaneously. Pattern 1: Employee AI Enablement Every employee uses AI assistants for research, drafting, summarization, and personal workflow automation. The human retains full decision-making authority; the agent recommends, the human decides. This is the most accessible pattern and the right starting point for most organizations. What most teams get wrong here: they treat this as a technology deployment rather than a behavior change program. The technology is the easy part. Getting people to actually change how they work to build the habit of using agents rather than falling back to familiar processes requires visible leadership role-modeling, continuous enablement, and a community that celebrates and shares what works. Licenses do not become usage on their own. Pattern 2: Business Expert Empowerment An expert's knowledge — in compliance, engineering standards, risk assessment, regulatory interpretation — is captured and scaled across the organization through an agent. The expert shifts from answering every question to teaching the agent and auditing its output. The critical insight here: the agent's credibility IS the product. If the agent gives wrong expert advice, you damage the expert's reputation and potentially the business. I've seen this pattern fail repeatedly because teams focused on building the agent and ignored knowledge quality controls. The agent is only as good as its source documents. If you cannot guarantee those documents are authoritative, current, and complete, you should not deploy this pattern. Pattern 3: Workplace and IT Services Agents operate internal services end-to-end: IT helpdesk, HR, Finance, Facilities. These agents don't just answer questions; they execute service workflows: processing leave requests, provisioning access, validating expenses, routing procurement. The scale-breaker I see consistently: teams automate individual tasks without redesigning the service flow. You end up with islands of automation that don't connect to a faster intake process that feeds into the same manual triage queue. Design the service first. Then build the agents. Pattern 4: Core Business Process Transformation Agents run core enterprise processes end-to-end: claims processing, order-to-cash, financial close, supply chain coordination. These are business-critical workflows where agents make decisions — not just suggestions — with direct impact on revenue, cost, and customer experience. This is where I see the most governance failures. Organizations apply the same lightweight controls they used for productivity agents to business-critical autonomous workflows. The result is agents making consequential decisions without audit trails, escalation paths, or defined autonomy limits. This pattern demands depth everywhere — there's no capability driver you can shortcut. Pattern 5: External Engagement Agents interact directly with customers, partners, or ecosystem stakeholders — crossing the enterprise trust boundary. Every interaction affects brand, reputation, and customer trust. Errors are visible externally. The non-negotiable here: external agents need higher governance and security maturity than any internal pattern because one bad customer interaction from an unsupervised agent is a brand crisis. Disclosure, consent, identity isolation, and real-time monitoring are not optional. Neither is a 15-minute incident response plan. Pattern 6: AI-First Capabilities Net new capabilities designed with agents as the core building block things that weren't possible before AI. Agents operate in sense-decide-act loops: continuously monitoring signals, making autonomous decisions within boundaries, executing actions, and learning from outcomes. This pattern demands the highest maturity across all capability dimensions. There's no existing process to compare against, no baseline to measure improvement from. Everything must be built — including how you measure success. Your pattern determines WHERE you invest, not just how much. Starting with the wrong pattern for your maturity level is a primary reason agents stall. The Maturity Trap Here's the mistake I see most often: organizations pick an ambitious pattern — say, core business process transformation — without honestly assessing whether their organizational capabilities can support it. They have Level 1 maturity in business strategy and governance but Level 3 technology infrastructure, and they convince themselves the technology readiness compensates for the organizational gaps. It doesn't. Maturity in this context spans five dimensions: how deliberately you plan and invest in AI strategy; how deeply AI is integrated into business processes and outcome measurement; how well you manage risk, compliance, and responsible AI; how mature your platforms, architecture, and data quality are; and how effectively you enable adoption and build an AI-positive culture. The critical insight is that your weakest dimension becomes your ceiling, regardless of how strong the others are. I've watched organizations with world-class AI infrastructure fail to scale agents because they had no governance model and no named owners for production agents. The technical foundation was irrelevant; the agents couldn't be trusted in production because nobody knew who was accountable when something went wrong. The goal is not to reach maximum maturity everywhere. Different patterns require different maturity depths across different dimensions. Your job is to identify which pattern you're pursuing, assess where you are today, find the biggest gap, and fix that first. The biggest gap is your scale-breaker. Five Scale-Breakers I've Seen in Production After working across multiple AI agent deployments, these are the patterns I see breaking scale most consistently: 1. Many Pilots, No Portfolio Agents aren't tied to measurable business outcomes. Each team builds something interesting, but there's no portfolio view, no named business owners, no defined success metrics. The fix: pick one or two outcomes, pick one or two patterns, name an owner for each, and define what success looks like before you build. 2. One-Off Agents, No Reuse Every team reinvents the wheel because there's no shared reference architecture, no standardized integration approach, and no common telemetry baseline. Each agent is a bespoke build that can't share components with anything else. At agent fifty, your maintenance burden is fifty independent systems. 3. Great Demos, Low Adoption The AI experience isn't designed end-to-end. Users don't know when to use the agent, what it can do, or how to validate its outputs. The fix: define golden paths for your top scenarios, how users engage, what's automated versus human-approved, and how exceptions are handled. 4. Licenses Don't Equal Usage Enablement and change management aren't systematic. There's no community, no training program, no champions network, no incentives tied to new ways of working. You can deploy Copilot to 10,000 employees and have 200 active users if you don't build a sustained enablement motion. 5. Shadow Agents Appearing Governance isn't operational. Teams build agents outside official channels because the official path is too slow or unclear. The fix isn't more process; it's making the safe path the easy path. Implement a minimum baseline: named owner, audit trail, release gate, monitoring, escalation path. Make that baseline so easy to satisfy that going around it takes more effort than using it. The Operating Model That Actually Works The operating model question that matters most is not 'what technology should we use' but 'who owns this agent, what happens when it goes wrong, and how does it improve over time.' In my experience, the organizations that scale agents successfully share three operating model characteristics that struggling organizations consistently lack. First, they treat agents as products, not projects. A project ends when the agent is deployed. A product has an owner, a monitoring plan, a feedback loop, and a defined path to improvement or retirement. Every agent in production without monitoring and an improvement plan is accumulating risk — knowledge goes stale, integrations break, user patterns change. Agents don't fail dramatically; they slowly drift, giving increasingly wrong answers with full confidence. That's worse than a crash, because nobody notices. Second, they govern proportionately to risk. They don't apply the same controls to a personal productivity agent that they apply to an agent processing financial transactions. Low-risk agents get lightweight controls — named owner, basic monitoring, standard release checklist. High-risk agents get production-grade SLA monitoring, security reviews, responsible AI assessments, decision rights frameworks, and incident response plans. Over-governing low-risk agents kills adoption. Under-governing high-risk agents creates liability. Third, they centralize how scale works, not who builds everything. The central team sets standards, manages platforms, runs community programs, and provides governance guardrails. Domain teams build and own agents within those guardrails. The central team's primary job is enablement, not control. Make the safe path the easy path. Agents don't scale through technology. They scale through people, ownership, and operating discipline. You don't need a bigger model. You need a better operating model. What I'd Do Differently If I were starting an enterprise agent program from scratch today, here's what I would prioritize differently based on production experience: Name an owner before you build. Not a team, a person. The accountability gap is the single most common failure point I see. When something goes wrong with an agent that 'the team' owns, nobody fixes it promptly because everyone assumes someone else is handling it. Run your maturity diagnostic before picking your pattern. Be honest about where you actually are, not where you aspire to be. A realistic assessment of your weakest dimension will tell you more about what pattern you're ready for than any technology readiness assessment. Deploy monitoring on day one, not after adoption. I have seen too many teams treat monitoring as a phase-two concern. By the time phase two arrives, there are already production agents with no visibility into accuracy, drift, or escalation patterns. If you can't monitor it, you can't trust it. Build your first agent for reuse, not just for the use case. The architectural decisions you make in your first production agent — how you handle telemetry, how you structure knowledge sources, how you design escalation paths — become the template every subsequent agent follows. Get those decisions right early, and the fiftieth agent will be easier to build, deploy, and operate than the fifth. The Bottom Line The technical capability to build production-grade AI agents exists today. The constraint is organizational. Most enterprises are running a twenty-first-century technology capability on a twentieth-century operating model — and wondering why it keeps stalling. The organizations winning with agents are not necessarily the ones with the best models or the most compute. They're the ones that figured out ownership, governance, and lifecycle discipline before they scaled. They built operating models designed for agents that execute — not just agents that assist. That shift from assist to execute is the one that changes everything. And it's the one most organizations are still not prepared for.
Coding agents are good now. They can write a function, fix a failing test, or walk you through a chunk of legacy code you'd rather not read. That part is settled. The harder question is what happens when you hand one a real piece of delivery work, something that has to change the database and the API and the UI and the tests all together, and keeps running long after you've stepped away from your desk. That's usually where a single agent starts to struggle, and it isn't because the model isn't smart enough. The limit is human attention. A team might have fifty things sitting in its backlog that an agent could help with, but somebody still has to scope each one, keep an eye on it, review what comes back, and confirm it actually works. So you can generate code far faster than before and still ship at about the same pace. The slow part just moved. Long delivery work is a different animal from a quick coding task. It needs someone to hold the scope steady, keep the architecture consistent from one file to the next, make sure the tests check what the feature is meant to do rather than what the code happens to do, review the result, and hand off cleanly to whatever comes next. Ask one agent to carry all of that in a single context window across a long run, and it tends to drift. You've probably watched it happen: it loses the plot halfway through, writes tests that pass only because they were shaped around the code it just produced, uses one pattern here and a different one three files over, rebuilds something that already existed, and then can't quite tell you what it finished and what it didn't. So you read every diff yourself. The agent writes code, and you're still doing the planning, reviewing, QA, and firefighting. There's a limit to how far that stretches. From One Agent to a Team A more workable setup is to stop giving one agent the whole job and split it the way a functioning team already does. One agent plans the work, another builds it, another checks it. Three roles get you most of the way. RoleResponsibilityOrchestratorUnderstands the goal, asks the clarifying questions, writes the plan, sets milestones, and decides how the work is sequenced.WorkerImplements one feature from clean context and commits it in a controlled way.ValidatorChecks the implementation independently, runs the checks, verifies behavior, and flags follow-up work. Keeping the building and the checking in different hands matters for the same reason people review each other's code. Whoever wrote it is invested in it working, and that bias is hard to spot from the inside. A fresh agent that had no part in those decisions tends to catch what the author missed. How Agents Coordinate Underneath the roles, the agents end up talking to each other in a few recurring ways, and it helps to have names for them. Delegation is the obvious one, and usually the first that teams build. An agent hands a scoped task to another and waits for the result. Creator-verifier is the one that matters most for software. One agent writes the code and a separate one, working from its own context, checks it. That separation is what stops an agent from grading its own homework. Direct communication lets agents talk without a coordinator in the middle. It's tempting and it's fragile, since state scatters across separate conversations and sooner or later somebody acts on something out of date. Negotiation is what happens when agents share a resource, which for us usually means the codebase. Two agents about to edit the same file have to work out who does what before they overwrite each other. Broadcast is one agent telling the rest about something that changed, like a new constraint or a failure everyone needs to know about. It's the least exciting of the five, and the one that quietly keeps the long run from falling out of sync. Define "Done" Before Any Code Gets Written Settling what "correct" means before anyone writes code does more for reliability than any amount of prompt tuning. It heads off a specific and very common failure. An agent builds a feature, then writes tests that wrap neatly around the feature it just built. Everything passes, coverage looks healthy, and none of it tells you whether the feature does what was actually asked for. Tests written after the code mostly confirm whatever the code already does. They don't find the bugs. A validation contract flips that order. During planning, before there's any code, you write down what the feature has to do: the behavior that has to exist, the edge cases that matter, the flows that have to work, the regressions you can't allow. A small change might need a handful of those. A big feature can need hundreds, spread across the backend, the API, the front end, and the full end-to-end paths. Each one gets tied to a feature, and a feature isn't finished until it satisfies the ones assigned to it. The effect is that "done" gets defined separately from however the code happens to come out. Workers build against the contract, validators check against it, and you stop relying on whether the code looks right and start measuring whether it works. Passing Tests Aren't the Same as Working Software You still want lint, type checks, unit tests, and code review. The trouble is that once an agent is shipping whole features on its own, those checks stop being enough. Plenty of changes pass every unit test and are still broken where it counts. The form renders fine, but the submit button does nothing. The endpoint returns exactly the right shape, filled with stale data. A flow that worked in isolation falls apart once it sits behind a login. A migration runs clean on a laptop and chokes on production-scale data. So the better systems add a validator that works more like a QA engineer than a linter. It launches the app, clicks around, fills in forms, and confirms the whole path works end to end. That's slow, and on a long task it's where most of the wall-clock time goes: not generating tokens, but waiting on a live application to do something and watching what it does. The trade is worth it, since generating code quickly without really checking it only gets you to the wrong answer faster. In one production run an engineer at Factory described, building a clone of Slack, the project finished with about half its lines of code being tests, and roughly 90% coverage, and the validation step never passed on its first try. That last part is the whole reason the loop exists. Long Runs Can't Rely on Memory Run something for hours or days and context starts leaking between the agents. A bigger context window doesn't really fix it. What helps is not letting a worker close out a task by simply announcing it's done. Instead, each worker leaves a written handoff: what it built, which files it touched, which commands it ran and how they exited, what it assumed along the way, what it ran into, and what it left unfinished. That makes the run auditable. When validation fails, the orchestrator reads back through the handoffs, works out where things went sideways, scopes the fix, and pulls the run back on track at the next milestone instead of discovering the mess at the very end. The teams who make this work don't count on their agents remembering anything; they write enough down that the next agent can safely pick up where the last one stopped. Factory has reported runs lasting as long as sixteen days on this kind of setup. More Agents Isn't More Throughput The instinct is to run everything in parallel. Ten agents should mean ten times the work, right? For software, it usually doesn't play out that way. Agents running at the same time tend to edit the same files, redo work that's already done, and make architectural choices that don't line up with each other. The effort of untangling all that eats whatever speed you gained, and you pay for the conflict in tokens on top of it. What works better is to run the actual changes one at a time and save the parallelism for read-only work, like searching the codebase, reading docs, looking up an API, or reviewing code. On paper that's slower. Over a long task it comes out ahead, because you spend far less time cleaning up conflicts, the handoffs stay cleaner, and the whole thing behaves more predictably. Pile on more agents without coordinating them and you don't get speed so much as a codebase that disagrees with itself. The Right Model in Each Seat These systems also change how you pick models, because no single model is the right choice for every seat. Planning tends to go better with a model that reasons slowly and carefully. Writing code rewards speed and fluency instead. Checking the work rewards something closer to stubbornness: following the instructions exactly and giving nothing the benefit of the doubt. The model that writes the best code is often not the one you'd trust to grade it. There's even a case for running the validator on a different provider, so it doesn't carry the same blind spots as the model that wrote the code. That's the argument for staying model-agnostic. You want to put the right model in each role and swap it out as models get better at particular things, rather than getting stuck with one vendor's weakest area showing up everywhere. It works in the other direction too. A solid scaffolding of contracts, checkpoints, and independent validators can prop up a weaker or open-weight model and get more out of it than it would manage alone. Most of the orchestration in these systems lives in prompts and skills rather than hardcoded logic, which is the reason a new model release tends to make them better instead of obsolete. The Case for Fewer Agents Everything up to here makes the case for splitting work across agents, so it's only fair to take the strongest counterargument seriously. In 2025, the team behind Devin put out a post titled "Don't Build Multi-Agents," and the heart of it is hard to dismiss. They argue that most multi-agent failures come down to context getting fragmented. When you fan work out to parallel subagents, each one quietly makes its own assumptions, and those assumptions don't reconcile when the pieces come back together. One subagent picks a naming convention, another picks a different one, and you're left with something that reads as coherent but doesn't actually fit. Their advice is to keep one agent on a single thread and compress the context as it grows instead of spreading it across a crowd of workers. Anthropic landed somewhere close, though more conditional, when it wrote up its own multi-agent research system around the same time. Splitting work across agents paid off for broad, parallel tasks like searching many sources at once, but it struggled on anything that needed one shared context and tight coordination, which is most of what software work is. Both write-ups end up pointing at the same shape described here. Don't run agents in parallel on tightly coupled work. Split the work by role, and let the coupled parts happen in order. What the Failure Data Shows This isn't only field intuition, either. In 2025, a group at Berkeley published a study called "Why Do Multi-Agent LLM Systems Fail?" that went through failure traces from several well-known frameworks and grouped what went wrong. What stood out was where the failures landed. They mostly weren't about the model being too weak. They were about design, with agents given vague roles or ignoring the roles they had; about coordination, with one agent sitting on information another needed or a conversation getting reset partway through; and about verification, with work marked finished that nobody really checked, or a run quitting too early. Those are the same three places this whole architecture tries to shore up, with clear roles, written handoffs, and validators that don't simply take an agent at its word. There's also hard evidence that giving each worker fresh context is more than tidiness. The "lost in the middle" research found that models pay the most attention to the start and end of their context and the least to whatever sits in the middle. Later work on "context rot" found accuracy slipping as the input gets longer, even on simple lookups. A worker drowning in a long accumulated history is a real, measured liability, not a theoretical one, and handing each worker a clean slate keeps the model working in the range where it's actually reliable. The Bill Comes Due It's easy to underestimate what these systems cost. More agents running for longer means a lot more tokens. Anthropic reported that a single agent already burns through several times the tokens of an ordinary chat, and a multi-agent system can use roughly an order of magnitude more on top of that. That only pencils out on work that's worth the spend. Running a multi-agent system to fix a typo is just an expensive way to fix a typo. A couple of things keep it in check. One is prompt caching. A long run reads the same stable context over and over, the system prompt, the codebase, the plan, and caching that material so it isn't reprocessed every time cuts the bill sharply, which is why anyone running these in production leans on it. The other is the serial discipline from earlier: every conflict you don't create is a repair cycle you don't pay for, and repair cycles are where a lot of tokens quietly disappear. How much these systems cost is mostly a design question, not a billing one. A Bigger Attack Surface Security rarely shows up on the architecture diagram, and every agent you add is another door. Even a single agent has a well-known soft spot in prompt injection, where instructions tucked into a web page or a file or a tool's output get read as commands rather than data. Add more agents and the problem grows. A poisoned document that one worker reads can smuggle instructions through a handoff into another worker with more access, or one that touches production directly. The shared state and the messages agents pass around become a channel an attacker can aim at on purpose. This is the kind of thing you build in from the start, because it's painful to bolt on later. The same controls that keep these systems correct also keep them safer. Validators that won't take an agent's own word for it, handoffs that record exactly which commands ran and what came back, limits on what any single worker is allowed to reach, all of that doubles as containment, so one compromised step can't quietly become a compromised system. The audit trail that helps a run recover from its own mistakes is the same one you'll be glad to have when something goes wrong on purpose. Where This Leaves the Engineer None of this puts engineers out of work. It moves the work up a level. Instead of hand-driving every step of an implementation, you spend your time deciding what should get built, what the real constraints are, what counts as correct, which parts of the architecture are worth protecting, and when a human has to sign off. It feels more like running a delivery operation than like chatting with a bot. And the biggest gain usually isn't speed. It's keeping several streams of work moving at once without quality slipping, and often ending up with a codebase in better shape than when you started, since the tests and checks and handoffs all become part of what ships. The real skill is knowing when to reach for any of this. For a small, contained change, one good agent on a single thread is simpler and cheaper and less likely to wander off. For serious delivery at scale, you need the planning and checking and recovery that a team provides, and the only way agents can do that work is inside the same kind of structure a team uses: real roles, a shared definition of done agreed before anyone starts, honest handoffs, shared state, and execution kept under control rather than just turned up to full speed.
Why Query Optimization Matters A Spark query written by a human and a Spark query executed by the engine are often very different things. The gap between them — the optimization — is what separates a job that runs in 3 minutes from one that runs in 3 hours on identical hardware. Databricks compounds Spark's native Catalyst optimizer with two additional layers: Adaptive Query Execution (AQE) – re-optimizes the query at runtime using actual statistics collected mid-jobPhoton – a C++ vectorized execution engine that replaces the JVM-based Spark executor for eligible operators Understanding all three lets you write queries that cooperate with the engine rather than fight it. The Catalyst Optimizer Pipeline Catalyst is Spark's rule-based and cost-based query optimizer. Every query — whether written in SQL, DataFrame API, or Dataset API — passes through the same four-stage pipeline before a single byte of data is read. Stage 1: Parsing — From SQL to Unresolved Logical Plan Python # ── Catalyst Stage 1: Parsing ───────────────────────────────────────────────── # Spark uses ANTLR4 to parse SQL into an Abstract Syntax Tree (AST). # At this point column names are NOT validated — the plan is "unresolved". from pyspark.sql import SparkSession spark = SparkSession.builder.appName("catalyst-demo").getOrCreate() # Both of these produce identical internal representations df_api = ( spark.table("prod.silver.events_clean") .filter("event_type = 'purchase'") .groupBy("platform") .agg({"revenue": "sum"}) ) sql_api = spark.sql(""" SELECT platform, SUM(revenue) AS total_revenue FROM prod.silver.events_clean WHERE event_type = 'purchase' GROUP BY platform """) # Inspect the unresolved logical plan (before analysis) df_api.explain(mode="formatted") # Output includes: # == Parsed Logical Plan == # 'Aggregate ['platform], ['platform, unresolvedAlias('sum('revenue), None)] # +- 'Filter ('event_type = 'purchase) # +- 'UnresolvedRelation [prod, silver, events_clean] The key insight here: UnresolvedRelation and unresolvedAlias mean Spark hasn't touched the catalog yet. Column names could be typos at this point and Catalyst doesn't know. Stage 2: Analysis — Binding to the Catalog The Analyzer walks the unresolved AST and looks up every relation and attribute against the Catalog (in Databricks, this is Unity Catalog). It resolves column names, infers data types, validates references, and binds functions. Python # ── Catalyst Stage 2: Analysis ──────────────────────────────────────────────── # After analysis, every column is resolved to a specific attribute with a type. # AnalysisException is thrown HERE if a column doesn't exist. from pyspark.sql import functions as F from pyspark.sql.utils import AnalysisException # Example of what Analysis catches: try: spark.table("prod.silver.events_clean") \ .select("nonexistent_column") \ .show() except AnalysisException as e: print(f"Analysis failed: {e}") # → AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] # A column or function parameter with name `nonexistent_column` cannot be resolved. # After successful analysis, inspect the resolved plan df = ( spark.table("prod.silver.events_clean") .filter(F.col("event_type") == "purchase") .select("platform", "revenue", "user_id") ) # The analyzed plan shows fully qualified attribute IDs like: # == Analyzed Logical Plan == # platform: string, revenue: double, user_id: string # Project [platform#42, revenue#67, user_id#31] # +- Filter (event_type#39 = purchase) # +- Relation prod.silver.events_clean[...] parquet print(df._jdf.queryExecution().analyzed()) Stage 3: Logical Optimization — Rule-Based Rewrites This is where Catalyst applies its ~100+ built-in rules to produce an equivalent but cheaper logical plan. Rules fire repeatedly in fixed-point iteration until the plan stabilises. Python # ── Catalyst Stage 3: Key Optimization Rules ────────────────────────────────── # RULE 1: Predicate Pushdown # Catalyst moves filters as close to the data source as possible, # so Spark reads fewer rows from Parquet. df_before = ( spark.table("prod.silver.events_clean") .join( spark.table("prod.silver.users_clean"), on="user_id" ) .filter(F.col("event_type") == "purchase") # ← filter AFTER join ) # Catalyst rewrites this internally as if you wrote: df_after_equivalent = ( spark.table("prod.silver.events_clean") .filter(F.col("event_type") == "purchase") # ← filter BEFORE join .join( spark.table("prod.silver.users_clean"), on="user_id" ) ) # Result: potentially millions fewer rows shuffled during the join # RULE 2: Column Pruning # Catalyst removes columns not needed by downstream operators. # Even if you select(*), Spark will only read the columns it needs. df_pruned = ( spark.table("prod.silver.events_clean") .select("*") .filter(F.col("event_type") == "purchase") .groupBy("platform") .agg(F.sum("revenue").alias("total_revenue")) ) # Internally, Catalyst prunes all columns except: event_type, platform, revenue # RULE 3: Constant Folding # Expressions with only literals are evaluated at plan time, not per-row. df_constants = spark.range(1000).select( F.lit(2 + 3 * 4).alias("always_14"), # folded to Literal(14) at plan time F.col("id") * F.lit(1).alias("same_id"), # simplified to just col("id") ) # RULE 4: Boolean Simplification # AND/OR chains with tautologies or contradictions are collapsed df_simplified = spark.range(100).filter( (F.col("id") > 10) & F.lit(True) # simplified to just (col("id") > 10) ) # See all optimizations applied: print(df_pruned._jdf.queryExecution().optimizedPlan()) Stage 4: Physical Planning — Strategies and Cost Models The physical planner maps each logical operator to one or more physical implementations and selects the best one using a cost model. The most impactful decision here is join strategy selection. Python # ── Catalyst Stage 4: Physical Planning & Join Strategies ──────────────────── # JOIN STRATEGY 1: Broadcast Hash Join (BHJ) # Best when one side is small enough to fit in executor memory. # No shuffle — the small table is broadcast to all workers. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10mb") # default large_df = spark.table("prod.silver.events_clean") # 500GB small_df = spark.table("prod.gold.product_catalog") # 8MB ← will be broadcast result_bhj = large_df.join(small_df, on="product_id") # BHJ auto-selected # Force BHJ with a broadcast hint (overrides threshold check): from pyspark.sql.functions import broadcast result_forced = large_df.join(broadcast(small_df), on="product_id") # JOIN STRATEGY 2: Sort Merge Join (SMJ) # Default for large-large joins. Both sides are sorted and merged. # Requires a full shuffle — expensive but handles any size. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") # disable BHJ large_df2 = spark.table("prod.silver.transactions_clean") # 200GB result_smj = large_df.join(large_df2, on="user_id") # SMJ selected # JOIN STRATEGY 3: Shuffle Hash Join (SHJ) # Hash-based, no sort. Chosen by AQE when one side is much smaller # than the other but still above the broadcast threshold. spark.conf.set("spark.sql.join.preferSortMergeJoin", "false") # WHOLE-STAGE CODEGEN: Spark fuses multiple operators into a single # Java function to avoid virtual dispatch overhead and intermediate objects. # Verify it's active in your plan: spark.conf.set("spark.sql.codegen.wholeStage", "true") # default result_bhj.explain(mode="formatted") # Look for: *(1) BroadcastHashJoin — the *(N) prefix = WholeStageCodegen stage N Adaptive Query Execution (AQE) AQE is Databricks' most impactful runtime optimization layer. It materializes shuffle map output statistics at shuffle boundaries and uses them to make three key decisions after data has been partially processed. Python # ── AQE Configuration ───────────────────────────────────────────────────────── # AQE is ON by default in Databricks Runtime 7.3+ spark.conf.set("spark.sql.adaptive.enabled", "true") # 1. Dynamic Partition Coalescing # Merges small post-shuffle partitions to avoid thousands of tiny tasks spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true") spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128mb") spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1") # 2. Dynamic Join Strategy Switching # Allows AQE to downgrade SMJ → BHJ at runtime if a side turns out small spark.conf.set("spark.sql.adaptive.localShuffleReader.enabled", "true") # AQE broadcast threshold (can be higher than static threshold since # we now KNOW the actual size) spark.conf.set("spark.sql.adaptive.autoBroadcastJoinThreshold", "30mb") # 3. Skew Join Optimization # Splits oversized partitions and replicates the non-skewed side spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true") spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5") # 5x median spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256mb") # Verify AQE decisions in the query plan: df = ( spark.table("prod.silver.events_clean") .join(spark.table("prod.silver.users_clean"), on="user_id") .groupBy("platform") .agg(F.sum("revenue").alias("total")) ) df.explain(mode="formatted") # Look for: AdaptiveSparkPlan isFinalPlan=true # and: == Final Physical Plan == (shows post-AQE decisions) The Photon Engine Photon is Databricks' native vectorized query engine written in C++. It replaces the JVM-based Spark executor for eligible operations, processing data in column-oriented batches (vectors) rather than row-by-row. Python # ── Photon Configuration & Verification ─────────────────────────────────────── # Photon is available on Databricks Runtime 9.1+ with Photon-enabled clusters. # Enable it at the cluster level (UI: Cluster > Configuration > Enable Photon) # or via config: spark.conf.set("spark.databricks.photon.enabled", "true") # Photon-accelerated operators (as of DBR 13.x): # ✅ Scan (Parquet, Delta) ✅ Filter / Project # ✅ Hash Aggregate ✅ Sort # ✅ Broadcast Hash Join ✅ Sort Merge Join # ✅ Window functions ✅ Union / Expand # ✅ String functions ✅ Math functions # ❌ UDFs (Python/Scala) ❌ Some complex types # ❌ Streaming (partial) ❌ RDD-based operations # Verify Photon is executing your query: df = spark.sql(""" SELECT platform, DATE_TRUNC('month', event_ts) AS month, SUM(revenue) AS total_revenue, COUNT(DISTINCT user_id) AS unique_buyers, AVG(revenue) AS avg_order_value FROM prod.silver.events_clean WHERE event_type = 'purchase' AND event_ts >= '2024-01-01' GROUP BY platform, DATE_TRUNC('month', event_ts) ORDER BY month DESC, total_revenue DESC """) df.explain(mode="formatted") # Look for operators prefixed with "Photon" in the physical plan: # == Physical Plan == # PhotonResultStage # +- PhotonSort [month DESC NULLS LAST, total_revenue DESC NULLS LAST] # +- PhotonShuffleExchangeSink hashpartitioning(platform, month) # +- PhotonGroupingAgg [platform, month], [sum(revenue), count(user_id), avg(revenue)] # +- PhotonFilter (event_type = purchase AND event_ts >= 2024-01-01) # +- PhotonScan parquet prod.silver.events_clean # Photon performance metrics appear in Spark UI under "Photon Metrics": # - Photon scan time # - Photon total compute time # - Rows processed by Photon vs fallback JVM Reading Explain Plans The explain(mode="formatted") output is your primary debugging tool. Here's how to read it efficiently: Python # ── Explain Plan Modes ──────────────────────────────────────────────────────── df = ( spark.table("prod.silver.events_clean") .filter(F.col("event_type") == "purchase") .join(broadcast(spark.table("prod.gold.product_catalog")), on="product_id") .groupBy("platform", "category") .agg( F.sum("revenue").alias("total_revenue"), F.count("*").alias("transaction_count") ) ) # Mode 1: simple (default) — compact tree df.explain() # Mode 2: extended — all 4 plan stages side by side df.explain(mode="extended") # Mode 3: formatted — human-readable with operator details (RECOMMENDED) df.explain(mode="formatted") # Mode 4: cost — includes estimated row counts and sizes (requires ANALYZE TABLE) df.explain(mode="cost") # Mode 5: codegen — shows generated Java code for WholeStageCodegen df.explain(mode="codegen") # ── Key Signals to Look For ─────────────────────────────────────────────────── # ✅ GOOD signs: # *(N) prefix → WholeStageCodegen active (operators fused) # BroadcastHashJoin → small table correctly broadcast, no shuffle # PhotonXxx → Photon accelerating this operator # AdaptiveSparkPlan → AQE is engaged # PartitionFilters → Delta/Parquet file skipping active # PushedFilters → filters pushed to Parquet reader # ❌ WARNING signs: # Exchange (shuffle) → unexpected shuffle (missing broadcast hint?) # SortMergeJoin → large-large join (may need Z-ORDER or AQE tuning) # HashAggregate x2 → partial + final agg = shuffle involved # CartesianProduct → missing join condition! Will OOM on large tables # ObjectHashAggregate → non-codegen path, JVM overhead # GenerateXxx → explode() or similar, can't be fused # ── ANALYZE TABLE: feed statistics to CBO ───────────────────────────────────── # Without stats, Catalyst uses default estimates (1M rows, 8 bytes/col). # Run ANALYZE to give the Cost-Based Optimizer real numbers. spark.sql("ANALYZE TABLE prod.silver.events_clean COMPUTE STATISTICS") spark.sql(""" ANALYZE TABLE prod.silver.events_clean COMPUTE STATISTICS FOR COLUMNS user_id, event_type, platform, revenue """) # Now explain(mode="cost") shows real row counts and sizes Tuning Reference Table A quick-reference guide for the most impactful Spark/Databricks configs, what they control, and when to change them: Config KeyDefaultWhat It ControlsWhen to Tunespark.sql.adaptive.enabledtrueMaster AQE switchKeep on; only disable for debuggingspark.sql.adaptive.advisoryPartitionSizeInBytes64mbTarget post-coalesce partition sizeIncrease to 128mb–256mb for large shufflesspark.sql.adaptive.skewJoin.enabledtrueAQE skew splitKeep on; tune skewedPartitionFactor if neededspark.sql.autoBroadcastJoinThreshold10mbStatic BHJ thresholdIncrease to 50mb–100mb if executor memory allowsspark.sql.adaptive.autoBroadcastJoinThreshold30mbAQE runtime BHJ thresholdIncrease if AQE isn't catching small tablesspark.sql.shuffle.partitions200Default shuffle partition countSet to 8 × num_cores for your clusterspark.sql.files.maxPartitionBytes128mbMax bytes per Parquet read partitionReduce for high-parallelism scansspark.databricks.photon.enabledtruePhoton vectorized engineKeep on; disable only for UDF-heavy jobsspark.sql.codegen.wholeStagetrueWhole-Stage CodeGen fusionKeep on; disable only for debuggingspark.sql.statistics.histogram.enabledfalseColumn histograms for CBOEnable after running ANALYZE TABLEspark.sql.cbo.enabledtrueCost-Based OptimizerKeep on; requires ANALYZE TABLE to be usefulspark.databricks.delta.optimizeWrite.enabledtrueAuto bin-pack write filesKeep on for all Delta writes Key Takeaways Catalyst has four stages: Parse → Analyze → Optimize → Plan. Each stage has a distinct job, and understanding them tells you exactly where to look when a query misbehaves.Predicate pushdown and column pruning are the two most impactful automatic optimizations — they reduce the data volume Spark has to move before any aggregation or join.AQE is not a set-and-forget feature: tune advisoryPartitionSizeInBytes to your actual data sizes, and verify its decisions with explain(mode="formatted") — look for AdaptiveSparkPlan isFinalPlan=true.Photon drops in transparently for most SQL and DataFrame operations. The exceptions are Python UDFs, RDD operations, and some complex types — refactor these away from hot paths.Run ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS on your most-joined tables. The CBO's join ordering and strategy decisions improve dramatically with real statistics vs. default estimates.explain(mode="formatted") is your most important debugging tool — learn to read it before reaching for cluster config changes. References Apache Spark — Catalyst Optimizer (Deep Dive Paper, Armbrust et al., SIGMOD 2015)Databricks — Adaptive Query ExecutionApache Spark Docs — Adaptive Query ExecutionDatabricks — Photon RuntimeDatabricks Blog — Photon: A Fast Query Engine for Lakehouse SystemsDatabricks — Cost-Based OptimizerApache Spark — Performance Tuning GuideDatabricks — Broadcast Join Hints"Photon: A Fast Query Engine for Lakehouse Systems" (Behm et al., SIGMOD 2022)Spark by Examples — Explain Plan Modes
On December 2, 2024, a security vendor called BeyondTrust noticed something wrong inside its own AWS account. By the time the investigation closed, the story that emerged was almost absurdly simple for something with this much fallout: an attacker — later attributed to the Chinese state-sponsored group Silk Typhoon — had used a software flaw to reach into a BeyondTrust cloud account and pull out an API key. Not a password. Not a phishing victim's login. A string of characters that a piece of software used to talk to another piece of software. With that one key, the attacker walked straight into the U.S. Department of the Treasury, reset internal passwords, accessed workstations inside the Office of Foreign Assets Control, and read unclassified documents before anyone noticed. The Treasury disclosed it to Congress on December 30. The Department of Justice indicted the alleged operators in March 2025. If you've never worked in security, here's the plain-English version of what happened: somewhere inside the machinery that runs modern software, there's almost always a "key" — a credential one computer program shows another to prove it's allowed to be there. Humans log in with passwords and, increasingly, a second factor on their phone. Software mostly doesn't. It just holds a key, often for months or years at a time, and whoever holds that key gets treated as trustworthy, no questions asked. The Treasury breach happened because one of those keys ended up in the wrong hands and nothing else stood between that key and a federal agency's internal documents. Two months later, a different flavor of the same problem produced the largest theft of digital assets in history. $1.5 Billion, One Developer's Laptop In February 2025, the cryptocurrency exchange Bybit lost approximately $1.5 billion in Ethereum in a single operation. Palo Alto Networks' Unit 42 threat research team later tied the attack to Slow Pisces, a North Korean state-linked group also known as Lazarus or TraderTraitor, and traced the entry point back to a developer at a third-party vendor that managed Bybit's multi-signature wallet infrastructure. The attackers didn't break Ethereum's cryptography. They stole that developer's AWS session tokens — another form of machine credential — and used them to gain administrative access to cloud infrastructure that could authorize transactions, then quietly altered what a routine-looking transaction actually did before it executed. Unit 42 then found the same pattern at a second cryptocurrency exchange later in 2025, this time running through Kubernetes, the orchestration system that now runs much of the cloud-native world. The attackers phished a developer, used the access on the developer's machine to drop a malicious workload directly into the exchange's production Kubernetes cluster, and had that workload expose its own service account token — a credential Kubernetes automatically hands to every running pod so it can talk to the cluster's control plane. The stolen token happened to belong to a CI/CD management identity with sweeping permissions. From there, the intruders queried secrets across namespaces, planted a backdoor, and pivoted into the exchange's cloud-hosted backend, reaching the financial systems behind it. Unit 42's broader research found suspicious activity consistent with service-account-token theft in 22 percent of cloud environments analyzed in 2025, and recorded a 282 percent year-over-year jump in Kubernetes-directed attacks overall. Different industries, different attackers, same root cause: a non-human credential that was both long-lived and broader in scope than the task in front of it ever needed. Why This Keeps Happening Identity and access management, as a discipline, was built for people. People have managers, onboarding dates, performance reviews, and an HR system that flags them the day they leave. A workload has none of that. A microservice can spin up, do its job, and disappear thousands of times a day; a service account, by contrast, often gets created once and never revisited again. CyberArk's research has been blunt about the resulting imbalance: machine identities now outnumber human ones by more than 80 to 1 in the average enterprise, and the security architecture protecting most of them still assumes the old, human-shaped world — an org chart, not a fleet of ephemeral containers. That mismatch is exactly why static secrets sprawl the way they do. A developer hardcodes a key during a deadline crunch, intending to externalize it "later." A Terraform state file ends up holding plaintext cloud credentials because nobody flagged it in review. A default Kubernetes service account token, more permissive than anyone realized, gets mounted into a pod by default because turning that off requires deliberate configuration most teams never get around to. None of these are exotic mistakes. They're the ordinary residue of moving fast, and they accumulate the way unpaid debt does — quietly, until the day someone calls it in. The structural fix has a name by now, even if adoption is uneven: frameworks like SPIFFE and its production runtime SPIRE replace the static key with a short-lived, cryptographically attested identity — something closer to a backstage pass that's reissued before every single show rather than a master key cut once and handed out forever. A workload proves what it actually is — which Kubernetes service account launched it, which container image it's running — and receives an identity document valid for minutes, not months. Steal that, and an attacker is racing a clock that resets automatically rather than one that only resets when a human notices something is wrong. Cloud providers offer narrower versions of the same idea for their own platforms — AWS's IAM Roles for Service Accounts, Google's Workload Identity Federation — letting a workload trade a short-lived token for cloud access instead of carrying a standing key in the first place. But identity alone doesn't close the loop, and this is the part most "zero trust" conversations skip past. None of it matters if nothing in your pipeline actually enforces it. Security By Design Is a Promise. CI/CD Is Where You Find Out If It's Kept. Plenty of organizations will tell you, with complete sincerity, that they practice "security by design." Most of them mean it stopped at an architecture review months before the first line of code shipped. That's not a fix, it's a memory of one. Code that deploys daily — sometimes hourly — doesn't wait for an annual audit to catch a misconfigured token or an over-privileged service account, and by the time a quarterly review would have caught the BeyondTrust-style key or the Bybit-style session token, the damage in both real cases was already done. The only version of "security by design" that survives contact with a real production pipeline is the one written as code and enforced automatically, at every stage, by something that can actually say no. Picture the pipeline this way: Plain Text Developer commits code | v CI build triggers | +--> SAST (code flaws) + SCA (dependency CVEs) + secrets scan | | | fail? -----> build blocked, developer notified | | | pass v Generate SBOM + sign artifact (Cosign) + build provenance (SLSA) | v Policy-as-code gate (OPA / Kyverno) | +--> checks: image from approved registry? running as non-root? | signature valid? provenance matches expected builder? | service account scoped to least privilege? | | fail? -----> deployment rejected, logged, alert raised | pass v Deploy to production | v Runtime monitoring + short-lived workload identity (SPIFFE/SPIRE, IRSA) | v Continuous re-verification — nothing trusted indefinitely Every box in that chain is a place where the Treasury breach or the Bybit breach could have stopped instead of escalating. A policy-as-code rule using Open Policy Agent's Rego language, or Kyverno's Kubernetes-native YAML equivalent, can flatly refuse to schedule a pod requesting broader RBAC permissions than its declared task needs — which would have directly undercut the over-privileged CI/CD identity that the crypto-exchange attackers rode into the cluster. A signing and attestation step using Cosign, tied to SLSA provenance, means a deployed artifact has to prove which build system actually produced it before it runs at all — closing exactly the kind of trust gap that let a single compromised AWS asset cascade into a stolen infrastructure API key at BeyondTrust. None of this is theoretical tooling. Red Hat's own Enterprise Contract documentation describes signing as tying an image to a specific builder identity precisely so an attacker can't substitute a malicious binary without the signature itself breaking and announcing the tampering. The Uncomfortable Bottom Line I don't think either of this year's headline breaches happened because anyone involved was careless in some obvious, fireable way. They happened because the credential — not the firewall, not the encryption, not the cleverness of the malware — was the actual asset under attack the entire time, and almost nothing downstream of "the key worked" was built to ask a second question. Gartner named non-human identity management a top strategic security trend for exactly this reason in 2025, and OWASP followed with a dedicated Non-Human Identity Top 10 the same year, an overdue acknowledgment that the tooling built for human logins was never going to be enough. My honest prediction, watching this pattern repeat across a federal agency and two of the largest crypto exchanges on earth within twelve months of each other: the organizations that treat policy-as-code enforcement and short-lived machine identity as default infrastructure — not optional hardening bolted on after an incident — are the ones that won't end up writing the next version of this story. Everyone else is currently running on borrowed time, secured by a key that, statistically, is already older than it should be.
There is a widespread assumption circulating in engineering teams right now that goes something like this: if AI can write code faster, it probably makes testing less of a bottleneck too. The logic seems reasonable on the surface. Faster code, faster tests, faster everything. This assumption is wrong, and teams that act on it are going to find out the hard way. AI-generated code does not reduce the need for regression testing. It amplifies it. And the teams that understand this early will have a significant quality advantage over those that do not. The Fundamental Misunderstanding When developers use AI coding assistants to generate functions, services, or entire modules, they are not producing code that has been verified against the real behavior of their system. They are producing code that is syntactically correct and structurally plausible, written by a model that has no knowledge of how their specific application actually runs in production. This is a critically important distinction. A human developer who has worked on a codebase for months carries implicit knowledge about which edge cases matter, which downstream services are flaky, and which data patterns appear in production that were never anticipated in the original requirements. An AI model has none of this context. It produces code that looks right and often is right for the happy path, but it has no way of knowing what the code needs to handle in the real world. The result is a class of defects that regression testing is uniquely positioned to catch: behaviors that work in isolation but break in the context of the full system. The Velocity Trap Here is where teams get into trouble. AI coding tools are genuinely fast. Developers using them can produce working code at a rate that was not possible before, and the productivity gains are real. But velocity without verification is just a faster path to production failures. The pattern plays out predictably. A team adopts AI coding assistance, development speed increases, the engineering leadership is happy, and everyone agrees to keep moving fast. What nobody adjusts is the regression testing strategy. The test suite that was sized for the previous pace of development is now covering a larger surface area of code, generated at higher volume, by a process that has no awareness of production context. Coverage gaps compound quietly. Nobody sees them until something breaks in production in a way that takes two days to trace back to a function that an AI wrote last sprint and nobody fully read. What AI-Generated Code Actually Gets Wrong The failures that emerge from inadequate regression coverage of AI-generated code tend to cluster in specific areas. Integration points are the most common failure zone. AI generates code based on interfaces and contracts. It looks at API signatures, function definitions, and data schemas. What it cannot see is how those contracts actually behave when real traffic flows through them. Consider a realistic scenario: an AI-generated service calls a downstream payment processor using the documented API specification. The code is technically correct. But the payment processor returns a slightly different response shape when a transaction is declined due to insufficient funds versus when it is declined due to a card expiry. The specification documents neither distinction. The AI has no way to know they exist. A regression suite built from real production traffic would catch this within the first test run. A regression suite built from the same specification the AI used to write the code will not catch it until a customer sees a wrong error message in production. Mock drift compounds the problem. When tests for AI-generated code are written using mocked dependencies, those mocks represent what the developer or AI thought the dependency would do. Over time, the real dependency changes and the mocks do not. Tests keep passing, the real behavior keeps drifting, and the regression suite provides false confidence rather than real coverage. AI-generated code optimizes for the stated requirement. It handles the case described in the prompt competently. It does not handle the cases that were not in the prompt: the empty array that should return a specific error, the timestamp that crosses a timezone boundary, the concurrent request that triggers a race condition. These are edge cases that only emerge from real usage patterns, and they are precisely what a regression suite built from real traffic catches where tests written from requirements do not. The Regression Testing Response Understanding these failure modes points directly to what needs to change in regression testing strategy when AI-generated code becomes part of the development process. Test generation needs to be grounded in real behavior, not assumed behavior. The traditional model of writing tests based on requirements becomes increasingly insufficient when the code being tested was generated by a model that had access only to those same requirements. The regression suite ends up testing exactly what the AI thought the code should do. Tests need to be grounded in what the system actually does when real requests flow through it. Integration test coverage becomes more important than unit test coverage. AI-generated code can usually pass unit tests because it generates syntactically correct implementations of isolated functions. The failures emerge at integration points. Regression testing that focuses on the integration layer, verifying that services interact correctly under realistic conditions, catches the class of failures that AI-generated code is most likely to introduce. Regression coverage should update continuously rather than incrementally. The pace of development with AI assistance creates a situation where code is being added to the codebase faster than manual test authoring can keep up. If the regression suite is maintained manually, it will always be behind. Coverage needs to grow with the codebase automatically, derived from real usage rather than added by developers who are already stretched by higher output demands. Production behavior should feed back into test validation. Closing the loop between how the system behaves in production and what the regression suite is testing is one of the most important shifts a team can make. When tests are derived from actual production traffic rather than written specifications, the mock drift problem largely disappears because the tests reflect what services actually do, not what developers assumed they would do. The Counter-Intuitive Conclusion There is a temptation to see AI-generated code and automated testing as solving the same problem from different angles. If AI can generate both the code and the tests, the reasoning goes, maybe the coverage problem solves itself. It does not. An AI that generates code and then generates tests for that code is essentially testing its own assumptions about how the code should behave. It will consistently produce tests that pass against the code it wrote, and those tests will systematically miss the gap between what the AI thought the code should do and what the system actually needs to do under production conditions. The gap between AI intent and production reality is exactly where regression testing has always been most valuable. AI-generated code makes that gap wider, not narrower, because the code is being written by something with no production experience at all. The teams that treat AI coding assistance as a reason to invest less in regression testing will eventually face production incidents that trace directly to this decision. The teams that treat it as a reason to invest more, particularly in coverage grounded in real system behavior rather than written specifications, will find that AI assistance genuinely accelerates development without accumulating the hidden quality debt that comes with uncovered integration failures. The Bottom Line Regression testing was never just a safety net. It is the mechanism by which a team validates that their understanding of the system matches how the system actually behaves. When AI is generating the code, that validation matters more than ever, because the code is now written by something that has never seen your system run. Invest accordingly.
Every React developer reaches a point where the sheer volume of boilerplate starts to slow them down. Prop drilling, repetitive hook patterns, component scaffolding, unit test setup — the cognitive overhead adds up fast, especially at enterprise scale. When GitHub Copilot entered my workflow, I expected a productivity boost. What I didn't expect was how much I'd have to think about using it correctly. After integrating AI-assisted development into a React 18 codebase — spanning custom hooks, context-based state management, and accessibility-driven UI — I came away with a clear picture of where AI genuinely accelerates the work, where it quietly introduces risk, and what guardrails every team needs before they ship AI-assisted code to production. This isn't a tutorial on setting up Copilot. It's an honest account of what changed in my day-to-day React workflow, and how I rebuilt my development process around the strengths of AI without surrendering architectural judgment. Where AI Actually Accelerates React Development 1. Component Scaffolding The most immediate win was generating boilerplate-heavy component shells. React functional components follow a predictable structure: imports, props interface, state declarations, effect hooks, render return. Copilot autocompletes this structure accurately and fast, especially when your file already has consistent patterns. For example, starting a new form component with a comment like: Plain Text // Controlled form component with validation and submit handler … triggers a usable scaffold within seconds. In a codebase with 50+ form components, this adds up to meaningful time savings. 2. TypeScript Prop Typing One of the most tedious parts of React 18 development is defining interface types for component props — especially for components consuming API response shapes. Copilot handles this well when the API shape is already defined elsewhere in the file or project. It infers prop types from usage context and generates clean interfaces without much guidance. 3. Unit Test Generation Copilot shines at generating @testing-library/react test cases for presentational components. Given a component file, it can suggest: Render testsUser interaction tests (click, input change)Accessibility checks using getByRole This reduced the time I spent on repetitive test scaffolding by roughly 40% for simple components. 4. Repetitive Hook Patterns Standard hooks like useEffect with cleanup, useCallback with dependency arrays, and useMemo for expensive computations follow well-known patterns. Copilot autocompletes these reliably — and the suggestions are often correct on the first try when the surrounding context is clear. Where AI Fails React Developers (and Why It Matters) This is the part most AI-workflow articles skip. In my experience, Copilot introduced subtle issues in three specific areas: 1. State Management Architecture Copilot is pattern-matching, not reasoning. When I was designing a context-based global state solution for a multi-step form flow, Copilot consistently suggested patterns that worked for isolated examples but didn't scale: it created redundant useContext calls across components that should have been wrapped in a provider, and it failed to account for re-render performance implications. The lesson: Never accept AI suggestions for state architecture without reviewing the component tree. AI optimizes locally; architecture requires global thinking. 2. Custom Hook Dependency Arrays Incorrect dependency arrays in useEffect and useCallback are a well-known React footgun. Copilot's suggestions here were hit-or-miss. It occasionally omitted dependencies that needed to be included and included stale values that triggered unnecessary re-renders. I started treating all AI-generated dependency arrays as drafts that required manual review against the ESLint react-hooks/exhaustive-deps rule. This step is non-negotiable. 3. Accessibility in JSX This one is subtle. Copilot generates functional JSX — but accessible JSX requires deliberate attention to ARIA roles, focus management, and semantic HTML. AI-generated components often defaulted to div-heavy markup without the aria-* attributes or keyboard event handlers that production apps require. For any component touching user interaction — modals, dropdowns, form controls — I reviewed AI-generated output against WCAG 2.1 AA standards before committing. My Rebuilt Workflow: A Practical Stack After months of iteration, here's the workflow that works: Phase 1: Design First, Prompt Second Before I open a new file, I sketch the component's responsibilities on paper or in a comment block: JavaScript /** * UserProfileCard * - Displays user avatar, name, role * - Supports edit mode toggle * - Emits onSave callback with updated values * - Must be keyboard accessible */ This comment becomes the Copilot context. The more specific the intent, the better the scaffold. Phase 2: Accept Scaffolding, Write Logic I accept Copilot suggestions for: Component shellProp interfaceState variable declarationsJSX structure for simple layouts I write manually: useEffect logic and cleanupEvent handler implementationsContext provider designError boundariesAny business logic touching API data Phase 3: Review AI-Generated Tests Copilot generates test scaffolding well. I review every generated test for: Correct use of userEvent vs fireEventAccurate assertions (not just "it rendered")Missing edge cases (empty state, error state, loading state) Phase 4: Accessibility Audit Pass Every component gets a final pass against: Semantic HTML element usagearia-label / aria-describedby for interactive elementsKeyboard navigation (tab order, focus trap for modals)Color contrast (handled at design system level, not component level) A Real Before-and-After Example Before (pre-AI workflow): A controlled input component with validation took roughly 25–30 minutes to scaffold, type, test, and review. After (AI-augmented workflow): The same component takes 10–12 minutes — with Copilot handling the initial scaffold and test shell, and me handling the validation logic, hook dependencies, and accessibility pass. Here's a simplified example of the kind of component where AI delivers the most value: TypeScript interface SearchInputProps { value: string; onChange: (value: string) => void; onSubmit: () => void; placeholder?: string; isLoading?: boolean; } const SearchInput: React.FC<SearchInputProps> = ({ value, onChange, onSubmit, placeholder = "Search...", isLoading = false, }) => { const handleKeyDown = (e: React.KeyboardEvent<HTMLInputElement>) => { if (e.key === "Enter") onSubmit(); }; return ( <div role="search"> <input type="search" value={value} onChange={(e) => onChange(e.target.value)} onKeyDown={handleKeyDown} placeholder={placeholder} aria-label="Search" disabled={isLoading} /> <button onClick={onSubmit} disabled={isLoading} aria-label="Submit search"> {isLoading ? "Searching..." : "Search"} </button> </div> ); }; The scaffold, prop interface, and JSX structure above were AI-generated in under 30 seconds. The aria-label attributes, role="search", and handleKeyDown implementation were my additions — things Copilot consistently missed in initial suggestions. Where AI Hits a Wall: Large-Scale Enterprise React Projects Small, isolated components are where AI shines. But real enterprise codebases are rarely small or isolated. Once you're working inside a large monorepo with hundreds of components, shared design systems, domain-specific business logic, and cross-team API contracts, AI-assisted development runs into a fundamental limitation: it only sees what's in its context window. Here's where that breaks down in practice: 1. Cross-File Dependency Awareness In a large React application, a single component may depend on a shared context provider defined four directories away, a utility hook maintained by a different team, and a TypeScript type exported from a core domain package. Copilot's autocomplete works within the file you're editing — it doesn't have a deep understanding of the full dependency graph. The result: AI-generated code that compiles locally but breaks at integration because it assumes a prop shape, import path, or context value that doesn't match what actually exists in the broader system. I've seen this surface most often with shared form validation schemas and API response types that live outside the component's immediate file tree. 2. Institutional Knowledge and Business Logic Enterprise React codebases carry years of intentional decisions that aren't documented anywhere in the code — they live in the heads of the team. Why is this particular component wrapped in a custom error boundary? Why does this dropdown use a local state copy instead of reading directly from context? Why is this API called twice? Copilot has no way of knowing. When it generates code in these areas, it produces something that looks reasonable but violates the implicit contract the team has built over time. Catching these violations requires a senior developer who understands the why behind the existing patterns — AI cannot substitute for that. 3. Design System Consistency at Scale Large teams typically maintain a shared component library — think an internal fork of Material UI or a custom design system. AI tools don't know which internal components to reach for. Copilot frequently suggests raw HTML elements or third-party components when the project has established internal equivalents: <Button> from your design system instead of <button>, <TextInput> from your library instead of a raw <input>. At scale, this creates design debt fast. Every AI-generated component that uses a raw HTML element instead of the design system equivalent is a component that diverges from your visual and behavioral standards — and accumulates technical debt that's expensive to audit later. 4. Performance Optimization in Complex Component Trees React 18 introduced useDeferredValue, useTransition, and concurrent rendering features specifically to handle performance in large, deeply nested component trees. These are nuanced APIs — their correct usage depends on understanding the rendering priority of specific subtrees, which operations are expensive, and what the user experience should be during transitions. Copilot-generated code in this area is almost always naive. It doesn't know that a particular list component renders 500+ items and needs virtualization. It doesn't know that a specific state update should be wrapped in startTransition to keep the UI responsive. Optimizing a large React application for performance remains deeply human work. 5. Multi-Team Merge Conflicts and Shared State In enterprise projects with multiple teams contributing to the same React codebase, shared state management becomes politically and technically complex. Redux slices, Zustand stores, or React Query caches span team boundaries. AI tools can suggest changes to these shared structures without awareness of how other teams depend on them — leading to breakages that only surface in integration environments. The practical takeaway: the larger and more interconnected the codebase, the more you need to treat AI as a localized assistant, not a system-aware collaborator. Use it to accelerate work on leaf-node components and isolated utilities. Treat any AI suggestion that touches shared state, cross-team APIs, or core infrastructure with the same scrutiny you'd give an external contributor who just joined the project. If you're introducing AI-assisted development into a React team, here are the non-negotiables: 1. Never merge AI-generated code without lint and type checks passing. Run eslint, tsc --noEmit, and your test suite before treating any AI-generated file as complete. 2. Establish a "no AI for architecture" rule. Component tree design, context structure, routing decisions, and data fetching strategy should be human-driven. AI is a code accelerator, not an architect. 3. Code review AI-generated PRs with extra scrutiny. Reviewers should specifically look for: missing hook dependencies, over-broad useEffect triggers, missing accessibility attributes, and logic that "looks right" but doesn't account for edge cases. 4. Document what AI touched. Some teams are beginning to tag AI-assisted code in commit messages or comments. This creates accountability and helps reviewers calibrate their scrutiny. 5. Keep your feedback loop active. When Copilot generates something wrong, reject it explicitly rather than accepting and editing. This helps calibrate your own pattern recognition for what AI does and doesn't handle well. What's Coming Next: Agentic React Workflows The current state of AI in React development is assistive — it completes what you start. The next wave is agentic: AI agents that can take a design spec or Figma export, scaffold an entire component hierarchy, wire up state, and generate test coverage — with a human reviewing the output rather than writing it line by line. Early tools like Cursor's Composer mode and experimental GitHub Copilot Workspace are beginning to move in this direction. For React developers, the implication is a shift in the skill that matters most: from writing components quickly to reviewing and evaluating AI-generated component systems critically. The developers who will thrive in this environment are those who deeply understand React's rendering model, state management tradeoffs, and accessibility requirements — not because they're writing every line, but because they're the final judgment layer on what ships. Conclusion AI-augmented development isn't about replacing React expertise — it's about redirecting it. The hours saved on scaffolding and boilerplate are hours you can reinvest in architecture, performance, accessibility, and code quality. The key insight from rebuilding my workflow around GitHub Copilot is this: AI is a force multiplier for what you already know well. If you understand React deeply, it makes you faster. If you're still learning React's mental model, it can quietly introduce patterns that seem right but aren't. Used with clear guardrails and deliberate review habits, AI turns a good React developer into a significantly more productive one — without sacrificing the code quality that enterprise applications demand.
TL;DR: The AI Delegation Audit Scrum teams inspect how the last Sprint went during the Retrospective. They are much less likely to inspect the work they have handed to AI, because no meeting on the calendar owns it. That gap is where a working AI automation quietly turns into risk: it keeps producing fluent, on-brand output long after the decision to trust it has expired. The AI Delegation Audit closes the gap by leveraging the facilitation skills teams already use in a Retrospective. Thesis: The Delegation Audit is the missing inspection cadence for delegated AI work. It checks four things: whether the work still meets the standard, whether the model still fits the task, whether the team can still stop the automation, and whether reviewed assistance has quietly become unreviewed automation. You can try it on one workflow in fifteen minutes. The Automation That Looked Healthy A product team automates its Friday stakeholder update in March. The setup is careful: the model drafts from the Jira board, the workflow owner reviews the draft, and it ships. For three months it works. In June, the same automation tells an enterprise prospect that a security feature is in production. No application code changed, and nobody touched the prompt. But the system around the automation had shifted: a descoped feature, a stale ticket title that survived in the product backlog, and a change in model behavior combined into a false update. The dangerous part was not a visible failure: the automation kept producing fluent, plausible, on-brand updates, which is exactly what made the degradation hard to notice. That points to the belief worth naming first: a workflow that still produces output is assumed to be still fully functioning. A working automation is not evidence that the delegation behind it is still valid, and validating it once, at setup, is not the same as keeping it valid. What the Delegation Audit Is The Delegation Audit of the A3 Framework borrows the facilitation pattern of a Retrospective, not the Scrum event itself. Instead of how the team worked, it examines how the team’s AI delegations are holding up: 45 to 60 minutes, monthly or every other Sprint, with a named owner and a slot on the calendar. In the A3 Framework, this is what the Automate category has always required. The moment you trust work to run with little or no human review, you owe it explicit rules and a recurring audit. Most teams adopt the rules and skip the audit because no one owns it. The Delegation Audit is that meeting, and it is the Inspect step of the AI Delegation Lifecycle. The name is deliberate: nobody in finance, security, or operations needs an agile glossary to understand what a delegation audit is or why a team runs one. The practice underneath is familiar: gather data, surface what changed, turn findings into decisions, and leave with owners. The Four Checks Each check inspects one way a delegation degrades after it goes live: Output and source drift: Does the work still meet its AI Definition of Done, and are the inputs still fit for use? Pull three recent outputs per workflow and trace each one back to its sources. Model updates change output quality in both directions without notice, and the inputs move along with them: stale records, changed permissions, and archived data that the model cannot tell from current facts. A polished summary built on stale data is still a failed delegation.Model fit: Is the assigned model still the right one? Look in both directions: a cheaper tier that no longer meets the standard, and a frontier model burning budget on work that a mid-tier now handles. The test is whether the model is sufficient for this task at this risk level, not whether it is the most capable one available. If your team runs a routing policy, this check feeds into it, and the cost side has its own treatment in token economics.Reversibility: Could you stop each automation today? Test the stop rules from your handoff: who pulls the plug, how fast, and whether that person still works here. An automation without a reachable owner is not delegated; it is abandoned, now posing a risk.Category creep: Which Assist work has become unreviewed Automate? Watch for the tell: review time per output trending toward zero. When a human approves a draft in 4 seconds, that is not review, and the work changed its A3 category without anyone deciding. Name it, then choose: promote it to Automate properly, with rules and a stop rule, or restore genuine review. Run It Like a Retrospective The agenda fits 60 minutes and will feel familiar: Data walk (10 min): Put the delegation inventory on the wall: every automated and assisted workflow, its A3 category, its model tier, its last audit date. Add usage or spend data if you have it. Look first, discuss later.Run the four checks in pairs (20 min): Assign workflows to pairs. Each pair runs all four checks on its workflows and marks each finding pass, drift, or fail.Re-classify (15 min): Walk through the findings. Every drift or fail gets a decision: change the A3 category, change the tier, update the AI Definition of Done, fix the stop rule, or retire the delegation. Retiring an automation that no longer earns its audit cost is a successful outcome of the meeting.Decisions and owners (10 min): Each decision gets a name and a date. A finding without an owner is one you will rediscover next time; don’t create waste.Close the record (5 min): Update the log: what moved, why, and who decided. Why Inspection Stopped Being Optional Two forces make a standing audit necessary now: The first is the models: they update on the vendor’s schedule, not yours. A change to how a model summarizes, refuses, or formats can move output quality with no signal on your side. An automation you validated once is running on assumptions that have quietly expired. The second is accountability: NIST organizes AI risk management around four functions: govern, map, measure, and manage. Inspection is the measure-and-manage half, and a team that only governs and maps has stopped before the work becomes operational. Set-and-forget is the default, and it compounds unseen until a drifted output becomes an incident in front of the wrong audience. The Record You Get for Free Each audit updates a dated log: workflow, owner, model tier, last checked output, drift finding, decision, and follow-up date. Stack those logs, and you have an inspection trail: evidence that your team’s AI adoption is controlled rather than assumed. When a stakeholder, for example, a prospect’s procurement team, asks how you govern your internal AI use, that trail is half the answer, and you wrote none of it as a separate report. It came out of one recurring meeting. What to Do in Your Next Retrospective Do not schedule a new event yet. Take one delegated workflow, the one that would embarrass you most if it drifted, and spend fifteen minutes of your next Retrospective running the four checks on it out loud: output and source, model fit, reversibility, category creep. You will probably find at least one answer that amounts to “nobody has looked since we set this up.” That single finding is enough to put the audit on the calendar. Conclusion A Retrospective keeps a team honest about how it works together. The Delegation Audit extends that same facilitation habit to the work the team handed to a model, where an automation can look healthy long after the decision to trust it has expired. When did your team last inspect an automation it trusts, and what would the four checks find if you ran them this week? Key Questions This Article Answers What Is a Delegation Audit? A Delegation Audit is a recurring 45- to 60-minute inspection of a team’s delegated AI work, run monthly or every other Sprint. It checks whether automated and AI-assisted workflows still meet the team’s standard, using the facilitation skills of a Retrospective. It is the Inspect step of the AI Delegation Lifecycle. What Does a Delegation Audit Check? Four things: Output and source drift (Does the work still meet its AI Definition of Done, and are the inputs still trustworthy?),model fit (Is the assigned model still the right one for the task and its risk level?),reversibility (Can you stop the automation today?), andcategory creep (Has Assist work become unreviewed Automate?). How Is a Delegation Audit Different From a Retrospective? Same skill, different subject. A Retrospective inspects how the team worked together. A Delegation Audit inspects how the team’s AI delegations are holding up, then turns each drift finding into a decision with an owner and a date.
A few weeks ago, I read a line from Boris Cherny, the person behind Claude Code, that stuck with me. He said he does not prompt Claude anymore. He has loops running, and those loops are the ones prompting Claude and deciding what to do next. I sat with that for a while. For two years, every guide on working with AI agents told us to get better at writing instructions. Then it told us to get better at feeding the model the right information. Then it told us to build proper scaffolding around the agent so that it behaves like trustworthy software. Now there is a fourth layer, and it is less about talking to the agent and more about building a small system that talks to the agent for you. People are calling it loop engineering. This piece walks through how we got here, what loop engineering actually means in plain terms, and where it fits next to the three ideas that came before it: prompt engineering, context engineering, and harness engineering. I have written about the first two before, so I will link back to those pieces where it helps rather than repeat myself. How we got here: Journey so far Each layer did not replace the one before it. It sat on top of it. You still write prompts. You still manage context. You just stopped being the one doing it by hand for every single turn. Here is the short version, side by side, before we go into each one in detail: LayerWhat you are actually doingWhere the skill livesGood fit forWeak fit forWhat breaks if you skip itPrompt engineeringWriting clear instructions for one turnThe words in the messageOne-off questions, demos, quick scripts, learning a new modelRepeated workflows, anything that touches production dataInconsistent answers, works once and fails the next ten timesContext engineeringChoosing what the model gets to see before it answersThe data, documents, and tool outputs around the promptRAG systems, chatbots over live data, anything where the model needs facts it was not trained onTasks where the prompt alone is already enoughConfident, well-written, wrong answersHarness engineeringBuilding the scaffolding that checks the agent's workThe system around one agent run: tools, evals, guardrails, logsCoding agents, multi-step automations, anything running without a human reading every stepSimple single-call requests with no follow-up actionsAgents that quietly do the wrong thing and nobody notices until laterLoop engineeringDesigning the system that decides what to run next, on its ownThe control loop sitting above the agent and its harnessLong-running or unattended work, backlog grooming, overnight batches, anything with a clear goal and a way to prove it is doneTasks with a vague goal or no real way to check successLoops that run for hours, spend budget, and produce work nobody asked for This table is the map for the rest of the piece. Each section below goes one layer deeper. Prompt Engineering: Where It All Started Prompt engineering was the first skill anyone associated with getting good output from a language model. Phrase the question well, give an example or two, ask the model to think step by step, and the answer gets better. It worked, and it still works for one-off tasks. The problem showed up once people tried to run the same model on a real workflow, again and again, across different inputs. A clever sentence that worked once does not hold up across a thousand runs with messy, real-world data. The short version is this: prompt engineering is great for experiments and demos, but production systems need something steadier than a well-worded sentence. Context Engineering: The Model Needed More Than Nice Words Context engineering showed up once teams realized that the actual bottleneck was rarely the phrasing. It was what the model could see. A model with a perfect prompt and no access to the right document, the right database row, or the right tool output will still guess wrong. Tobi Lütke from Shopify put it simply: context engineering is the art of giving the model everything it needs so the task is actually solvable. Andrej Karpathy described it as the careful science of filling the context window with exactly the right information for the next step, not more, not less. By late 2025, prompt engineering had become something people did inside context engineering rather than a separate skill on its own. Harness Engineering: Making One Agent Run Trustworthy Once teams started letting agents take multiple steps on their own, a new problem appeared. The agent might write code, run a test, look at the failure, and try again. That loop within a single task needed rules. What tools can it call? What happens if it gets stuck? How do you stop it from quietly making the wrong change to a file it was never supposed to touch? This is harness engineering, and I spent a full piece on it earlier this year in From Prompts to Harnesses: How AI Engineering Has Grown Up. The short version: prompt engineering got the conversation started, context engineering made the answers consistent, and harness engineering is what actually makes an agent safe to run in production, because it stops depending on the model behaving well and starts depending on a system around the model that checks its work. Think of the harness as the seatbelt, the dashboard, and the guardrails for one agent doing one job. It covers things like giving the agent a clean view of the repository, exposing the right API contracts, watching logs and live CI status as part of its context, and building eval gates that catch bad output before it ships. I went deeper into one well-known pattern for this layer in The Twelve-Factor Agents: Building Production-Ready LLM Applications, borrowing from the original Twelve-Factor App rules and adapting them for agents: one clear purpose per agent, explicit dependencies, and a strong separation between business logic and execution state. I also looked at the architectural side of this in AI Agent Architectures: Patterns, Applications, and Implementation Guide, where orchestrator-worker setups and blackboard-style coordination turn out to be different ways of answering the same question a harness has to answer: who is allowed to do what, and who checks the result. Picking an architecture for the agents inside your harness is its own decision, and it is worth slowing down on, because the wrong pattern for the job tends to show up as flaky behavior that looks like a model problem but is actually a structure problem: ArchitectureHow it worksBest forWatch out forSingle agentOne agent, one prompt loop, one set of toolsSimple, well-bounded tasks like answering support tickets or summarizing a documentFalls apart fast once the task needs more than a handful of stepsOrchestrator-workerA central agent breaks the job into pieces and hands each piece to a specialist agentTasks that can be cleanly split, like "research this, then write this, then format this"The orchestrator becomes a single point of failure if it makes a bad plan early onBlackboardAgents post partial answers to a shared space and pick up work opportunistically, with no central controllerOpen-ended problems where the right order of steps is not known in advance, like diagnosis or researchHarder to debug, since there is no one place that decided what happens nextEvent-drivenAgents react to events as they happen rather than being called in a fixed orderSystems that need to respond to changes in real time, like monitoring or alertingNeeds solid event delivery guarantees, or agents miss things silentlyGraph or loop-basedA loop or graph decides which agent or sub-agent runs next, based on state and results so farLong-running, multi-stage work where the next step depends on what the last step foundThe whole thing is only as reliable as the state tracking and the checks between steps None of this works without the layer underneath it, either. The compute, storage, and serving choices that decide whether a harness can actually run reliably at scale are covered in AI Infrastructure for Agents and LLMs: Options, Tools, and Optimization and its follow-up, AI Infrastructure Guide: Tools, Frameworks, and Architecture Flows. And because a harness is still software that needs to be deployed and rolled back like anything else, Infrastructure as Code: How Automation Evolved to Power AI Workloads walks through why pinning model versions in config, the same way you pin a database connection string, has become a basic rule rather than a nice-to-have. I rounded up some of the tools doing this well in Developer Tools That Actually Matter in 2026. There is also a second, smaller decision inside the harness itself: what kind of check are you actually running at each step? Some checks are deterministic and cheap; others are judgment calls made by another model. Both have a place, and most solid harnesses use a mix: Check typeWhat it meansExampleSpeed and costReliabilityComputationalA fixed rule, run by regular codeUnit tests, linters, type checkers, schema validationFast and cheapVery reliable, but only catches what you thought to check forInferentialAnother model judges the outputLLM-as-judge review, an AI code reviewer, a second model checking tone or factual accuracySlower and more expensiveCatches fuzzier problems, but can itself be wrong or inconsistent For the fuller operational picture, the practices around monitoring, cost control, and incident response for agents running in production are laid out in the Shipping Production-Grade AI Agents refcard, including a simple three-question framework for any agent incident: what was the agent trying to do, what did it actually do, and what state did it change in the outside world. Harness engineering solved a real problem: it made a single agent run safe enough to trust. But it still assumed a person was sitting there, kicking off the run, reading the result, and deciding what happens next. Loop Engineering: Stop Being the One Who Presses the Button This is the part that changed in 2026. Geoffrey Huntley, earlier in the year, described running a coding agent inside a plain loop in his terminal: give it the same prompt against a written spec, let it pick one task, implement it, then start a fresh copy of the agent and feed it the same prompt again. People started calling this the Ralph technique, after the simple while loop running underneath it. It looked almost too basic to matter, but it worked, and it pointed at something bigger. Loop engineering takes that idea and makes it a discipline. Instead of typing a prompt, reading the answer, and typing the next prompt yourself, you build a small system that does that cycle for you. It checks what work is pending, decides what an agent should try next, hands the task off, checks whether the result actually meets the goal, saves what it learned, and either stops or starts the cycle again. Addy Osmani, who wrote one of the essays that got this term moving, framed it well: loop engineering means replacing yourself as the one who prompts the agent. You design the system that prompts it instead. Loop engineering A few things matter a lot once you build one of these: The goal has to be provable, not just stated. "Make the checkout flow better" gives a loop nothing to check itself against, so it will stop whenever it feels like it has done enough, which is rarely what you wanted. People running these loops for real work have settled on writing out the end state they expect, the proof needed to show it was reached, the rules that cannot be broken along the way, and a hard limit on how long or how much the loop is allowed to run. The verifier is the actual bottleneck, not the model. A loop is only as good as its ability to tell good work from bad work. If the check at the end of each cycle is weak, the loop will happily mark broken work as done and move on. Most of the engineering effort in a good loop goes into that verification step, not into the prompt that kicks the agent off. Prompts and context did not disappear, they moved inside the loop. The loop is still writing prompts and assembling context on every cycle. It is just doing it itself, using the same context engineering ideas from earlier, instead of a person typing it fresh each time. This is why I think of loop engineering as sitting on top of the other three rather than replacing any of them. Not every loop looks the same, either. Once you decide a loop is the right tool, the next choice is how tightly you want to hold its leash: Loop styleHow it runsGood forRiskClosed loop, human approves each stepLoop proposes the next action, a person clicks approveHigh-stakes changes, early days of trusting a new loopSlow, defeats some of the point if you approve everything anywayOpen loop, runs to a budget or time limitLoop runs unattended until it hits a turn limit, a cost cap, or finishes the goalOvernight batches, backlog grooming, well-scoped refactorsCan burn budget on the wrong thing if the goal or verifier is weakSingle agent in a plain loop (the Ralph style)One agent, one spec, fresh instance each cycle, no memory carried forward except what is written to diskSmall, well-defined coding tasks where starting fresh each time avoids the agent confusing itselfRepeats work it has no memory of, and needs a very clear spec to avoid driftingOrchestrated loop with sub-agentsA controlling loop spawns specialized sub-agents for different parts of the task and merges resultsLarger goals that naturally split into independent piecesCoordination overhead, and a weak verifier at the merge step undoes everything underneath it Where This Connects to the Bigger Picture None of these lives in isolation. A loop that spawns multiple agents to work on different pieces of a task at the same time needs the kind of coordination covered by multi-agent orchestration work, including how agents talk to tools through the Model Context Protocol and to each other through agent-to-agent protocols. And every loop running unattended for hours needs the same monitoring, cost tracking, and incident response discipline that harness engineering already worked out, just applied continuously instead of once per run. There is also a real risk worth naming honestly: people are already calling the overuse of unattended loops "loopmaxxing," where a loop runs for hours, burns budget, and produces a pile of code nobody asked for because the goal was vague and the verifier was weak. A loop is not magic. It is a control system, and like any control system, it is only as good as what you tell it to check for. Conclusion I have watched this field rename itself every year or so since I started writing about agents, and each rename has felt a little like marketing at first glance. Loop engineering is different in one respect: it describes a real change in where the human sits in the workflow. We went from writing every prompt by hand, to curating what the model sees, to building safety nets around a single run, and now to designing a small system that runs the whole cycle on our behalf while we go do something else. The job did not get smaller. It got one level removed from the keyboard. If you found context engineering useful, the next worthwhile habit to build is writing goals that can actually be proven true or false, because that is the one piece that a loop cannot do for itself. If any of the earlier pieces in this chain are new to you, start with the prompt-to-context shift, then read how that grew into harness engineering in From Prompts to Harnesses: How AI Engineering Has Grown Up, and the production patterns for agents in the Twelve-Factor Agents piece. You can find the rest of what I have written on agents, infrastructure, and developer tools on my DZone author page.
Why Fine-Tune on Databricks? General-purpose LLMs like Llama 3, Mistral, or Falcon are impressive out of the box — but they underperform on domain-specific tasks: medical coding, legal clause extraction, internal support ticket classification, and financial report summarization. Fine-tuning adapts a pre-trained model's weights to your domain using your proprietary labeled data. Doing this at scale introduces real engineering challenges: Training data lives in Delta Lake across dozens of tablesGPU clusters need to be orchestrated, not hand-managedExperiment tracking must be reproducible and auditableModels need a promotion workflow before they touch production traffic Databricks solves all of this in one platform: Apache Spark for large-scale data preparationMLflow (built-in) for experiment tracking, model registry, and lineageDatabricks Model Serving for one-click deployment with auto-scalingUnity Catalog for governed model and data access The ML Lifecycle Architecture Training Pipeline: End-to-End Flow The flow below shows how a single training run moves through the system — from a triggered job to a promoted model alias. Environment Setup Python # Databricks Runtime ML 14.x+ recommended (ships CUDA, PyTorch, Transformers) # Install additional packages in your cluster init script or notebook %pip install \ transformers==4.40.0 \ peft==0.10.0 \ trl==0.8.6 \ accelerate==0.29.3 \ horovod[spark]==0.28.1 \ datasets==2.19.0 \ evaluate==0.4.1 \ --quiet dbutils.library.restartPython() import os import mlflow import mlflow.transformers import torch from transformers import ( AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling, ) from peft import LoraConfig, get_peft_model, TaskType from pyspark.sql import functions as F from datasets import Dataset # ── MLflow setup ────────────────────────────────────────────────────────────── # On Databricks, MLflow tracking URI is pre-configured to the workspace # mlflow.set_tracking_uri("databricks") # uncomment for external clusters EXPERIMENT_NAME = "/Users/[email protected]/llm-finetuning/support-classifier" mlflow.set_experiment(EXPERIMENT_NAME) BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.2" CATALOG = "prod" GOLD_DB = f"{CATALOG}.gold" MODEL_NAME = f"{CATALOG}.ml.support_intent_classifier" # Unity Catalog model path print(f"GPU available: {torch.cuda.is_available()}") print(f"Device count: {torch.cuda.device_count()}") Preparing Training Data With Spark Spark handles the heavy lifting before training: filtering noisy records, formatting prompt-response pairs, and splitting the dataset. This stage runs on the CPU cluster — GPU nodes only spin up for the actual training job. Plain Text # ── Spark Data Preparation ──────────────────────────────────────────────────── def build_prompt(row): """ Format a support conversation into an instruction-following prompt. Uses the Mistral instruct template: [INST] ... [/INST] """ return f"[INST] Classify the intent of this support message:\n\n{row['message']} [/INST] {row['intent_label']}" # Load from Delta Gold table raw_df = ( spark.table(f"{GOLD_DB}.support_conversations") .filter(F.col("quality_score") >= 0.85) # keep high-quality labels only .filter(F.col("intent_label").isNotNull()) .filter(F.length("message") > 20) # filter empty/stub messages .filter(F.length("message") < 2048) # filter messages too long to tokenize .dropDuplicates(["message_hash"]) # remove exact duplicates .select("message", "intent_label", "created_date") .limit(500_000) # cap for this training run ) print(f"Training candidates: {raw_df.count():,}") # Build prompt strings using Spark — parallelized across all workers prompt_udf = F.udf( lambda msg, label: f"[INST] Classify the intent of this support message:\n\n{msg} [/INST] {label}", returnType="string" ) prepared_df = ( raw_df .withColumn("prompt", prompt_udf(F.col("message"), F.col("intent_label"))) .withColumn("token_count", F.size(F.split(F.col("prompt"), r"\s+"))) # rough word count proxy .filter(F.col("token_count") < 512) # stay within model context .select("prompt", "token_count", "created_date") ) # Stratified split using Spark (reproducible with seed) train_df, val_df, test_df = prepared_df.randomSplit([0.80, 0.10, 0.10], seed=42) # Persist splits to Delta for lineage + reproducibility train_df.write.format("delta").mode("overwrite").saveAsTable(f"{GOLD_DB}.llm_train_split") val_df.write.format("delta").mode("overwrite").saveAsTable(f"{GOLD_DB}.llm_val_split") test_df.write.format("delta").mode("overwrite").saveAsTable(f"{GOLD_DB}.llm_test_split") print(f"Train: {train_df.count():,} | Val: {val_df.count():,} | Test: {test_df.count():,}") Fine-Tuning With Hugging Face + MLflow Tracking We use LoRA (Low-Rank Adaptation) — a parameter-efficient fine-tuning technique that freezes the base model and only trains a small set of adapter matrices. This cuts GPU memory requirements by ~70% compared to full fine-tuning, making 7B parameter models trainable on a single A100. Python # ── LoRA Fine-Tuning with MLflow Autolog ───────────────────────────────────── # Convert Spark DataFrame to Hugging Face Dataset train_pd = spark.table(f"{GOLD_DB}.llm_train_split").select("prompt").toPandas() val_pd = spark.table(f"{GOLD_DB}.llm_val_split").select("prompt").toPandas() hf_train = Dataset.from_pandas(train_pd) hf_val = Dataset.from_pandas(val_pd) # Load tokenizer and base model tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, padding_side="right") tokenizer.pad_token = tokenizer.eos_token def tokenize(batch): return tokenizer( batch["prompt"], truncation=True, max_length=512, padding="max_length", ) hf_train_tok = hf_train.map(tokenize, batched=True, remove_columns=["prompt"]) hf_val_tok = hf_val.map(tokenize, batched=True, remove_columns=["prompt"]) # Load base model in 4-bit quantization (QLoRA) from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) base_model = AutoModelForCausalLM.from_pretrained( BASE_MODEL, quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) # Apply LoRA adapter config lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # rank — higher = more capacity, more memory lora_alpha=32, # scaling factor lora_dropout=0.05, target_modules=["q_proj", "v_proj"], # attention layers to adapt bias="none", ) model = get_peft_model(base_model, lora_config) model.print_trainable_parameters() # Typical output: trainable params: 13,631,488 || all params: 3,765,522,432 || trainable: 0.36% # Training arguments training_args = TrainingArguments( output_dir="/dbfs/tmp/llm-finetune/checkpoints", num_train_epochs=3, per_device_train_batch_size=4, per_device_eval_batch_size=4, gradient_accumulation_steps=8, # effective batch size = 32 warmup_ratio=0.03, learning_rate=2e-4, fp16=False, bf16=True, # use bfloat16 on A100/H100 logging_steps=50, eval_strategy="steps", eval_steps=200, save_strategy="steps", save_steps=200, load_best_model_at_end=True, metric_for_best_model="eval_loss", report_to="mlflow", # pipe all metrics to MLflow automatically ) data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) trainer = Trainer( model=model, args=training_args, train_dataset=hf_train_tok, eval_dataset=hf_val_tok, tokenizer=tokenizer, data_collator=data_collator, ) # ── MLflow Run ──────────────────────────────────────────────────────────────── with mlflow.start_run(run_name="mistral-7b-lora-v1") as run: # Log hyperparameters manually for full auditability mlflow.log_params({ "base_model": BASE_MODEL, "lora_rank": lora_config.r, "lora_alpha": lora_config.lora_alpha, "lora_dropout": lora_config.lora_dropout, "target_modules": str(lora_config.target_modules), "quantization": "4-bit QLoRA (nf4)", "train_samples": len(hf_train_tok), "val_samples": len(hf_val_tok), "epochs": training_args.num_train_epochs, "effective_batch": training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps, "learning_rate": training_args.learning_rate, }) # Train — metrics auto-logged to MLflow via report_to="mlflow" trainer.train() # Log final eval metrics explicitly eval_results = trainer.evaluate() mlflow.log_metrics({ "final_eval_loss": eval_results["eval_loss"], "final_eval_perplexity": torch.exp(torch.tensor(eval_results["eval_loss"])).item(), }) # Log the model + tokenizer as a single MLflow artifact mlflow.transformers.log_model( transformers_model={"model": trainer.model, "tokenizer": tokenizer}, artifact_path="model", task="text-generation", registered_model_name=MODEL_NAME, # auto-registers to Unity Catalog metadata={"base_model": BASE_MODEL, "finetuning": "QLoRA"}, ) run_id = run.info.run_id print(f"Run ID: {run_id}") print(f"Eval Loss: {eval_results['eval_loss']:.4f}") Distributed Training With Horovod on Spark For datasets beyond a few million tokens, or when you need to fine-tune models larger than 13B parameters, single-node training hits GPU memory walls. Horovod distributes training across multiple GPU workers using ring-allreduce — each worker holds a full model replica, and gradients are averaged across workers after every backward pass. Python # ── Distributed Fine-Tuning with Horovod on Spark ──────────────────────────── # Best for: datasets > 5M tokens, models > 13B params, or when you need # to reduce wall-clock training time below a business SLA. import horovod.torch as hvd from sparkdl import HorovodRunner def train_fn(hparams): """ Training function executed on each Horovod worker. Each worker trains on a data shard; gradients are averaged across workers. """ import horovod.torch as hvd from transformers import AutoModelForCausalLM, Trainer, TrainingArguments from datasets import load_from_disk hvd.init() # Each worker loads only its shard local_rank = hvd.local_rank() world_size = hvd.size() torch.cuda.set_device(local_rank) # Load dataset shard for this worker dataset = load_from_disk(f"/dbfs/tmp/llm-finetune/train_shards/shard_{local_rank}") model = AutoModelForCausalLM.from_pretrained( BASE_MODEL, torch_dtype=torch.bfloat16, ).to(f"cuda:{local_rank}") # Wrap optimizer with Horovod DistributedOptimizer optimizer = torch.optim.AdamW(model.parameters(), lr=hparams["lr"]) optimizer = hvd.DistributedOptimizer( optimizer, named_parameters=model.named_parameters(), compression=hvd.Compression.fp16, # compress gradient communication ) # Broadcast initial model weights from rank 0 to all workers hvd.broadcast_parameters(model.state_dict(), root_rank=0) hvd.broadcast_optimizer_state(optimizer, root_rank=0) training_args = TrainingArguments( output_dir=f"/dbfs/tmp/llm-finetune/hvd_output", num_train_epochs=hparams["epochs"], per_device_train_batch_size=hparams["batch_size"], bf16=True, no_cuda=False, dataloader_num_workers=2, # Only rank 0 logs and saves — avoids duplicated artifacts report_to="mlflow" if hvd.rank() == 0 else "none", save_strategy="epoch" if hvd.rank() == 0 else "no", ) trainer = Trainer( model=model, args=training_args, train_dataset=dataset, optimizers=(optimizer, None), ) trainer.train() # Only rank 0 registers the model if hvd.rank() == 0: mlflow.transformers.log_model( transformers_model={"model": model, "tokenizer": tokenizer}, artifact_path="model", registered_model_name=MODEL_NAME, ) # Launch distributed training across N GPU workers # np = number of processes = number of GPUs across all nodes hr = HorovodRunner(np=8, driver_log_verbosity="all") # 8 GPUs (e.g., 2 × 4-GPU nodes) hr.run(train_fn, hparams={ "lr": 2e-5, "epochs": 3, "batch_size": 2, # per GPU; effective = 2 × 8 = 16 }) MLflow Model Registry and Promotion Once a run completes, models land in the MLflow Model Registry. Databricks uses Unity Catalog-backed model aliases (candidate, staging, champion) instead of the legacy stage model. Python # ── Model Registry Promotion Workflow ───────────────────────────────────────── from mlflow.tracking import MlflowClient client = MlflowClient() # Get the latest registered version from the training run latest_version = client.get_registered_model(MODEL_NAME).latest_versions[0].version # Tag the new version as a candidate for review client.set_registered_model_alias( name=MODEL_NAME, alias="candidate", version=latest_version, ) client.set_model_version_tag( name=MODEL_NAME, version=latest_version, key="fine_tuned_on", value="gold.support_conversations", ) client.set_model_version_tag( name=MODEL_NAME, version=latest_version, key="eval_loss", value=str(round(eval_results["eval_loss"], 4)), ) # After human review / automated eval gates pass → promote to staging client.set_registered_model_alias( name=MODEL_NAME, alias="staging", version=latest_version, ) # After integration tests pass → promote to champion (production) client.set_registered_model_alias( name=MODEL_NAME, alias="champion", version=latest_version, ) # Load model by alias — decouples code from version numbers champion_model = mlflow.transformers.load_model(f"models:/{MODEL_NAME}@champion") Serving With Databricks Model Serving Python # ── Deploy to Databricks Model Serving ──────────────────────────────────────── # Can also be done via the UI: Models > Serving > Create Endpoint import requests, json WORKSPACE_URL = "https://<your-workspace>.azuredatabricks.net" TOKEN = dbutils.secrets.get("prod-scope", "databricks-token") endpoint_config = { "name": "support-intent-classifier", "config": { "served_models": [ { "name": "mistral-7b-lora-champion", "model_name": MODEL_NAME, "model_version": latest_version, "workload_size": "Small", # 1 GPU "scale_to_zero_enabled": True, "workload_type": "GPU_LARGE", # A10G } ], "traffic_config": { "routes": [ {"served_model_name": "mistral-7b-lora-champion", "traffic_percentage": 100} ] }, "auto_capture_config": { "catalog_name": CATALOG, "schema_name": "ml", "table_name": "support_classifier_inference_log", "enabled": True, # log all requests/responses to Delta } } } response = requests.post( f"{WORKSPACE_URL}/api/2.0/serving-endpoints", headers={"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json"}, data=json.dumps(endpoint_config), ) print(response.json()) # ── Query the endpoint ──────────────────────────────────────────────────────── def classify_intent(message: str) -> str: payload = { "inputs": {"prompt": f"[INST] Classify the intent of this support message:\n\n{message} [/INST]"}, "params": {"max_new_tokens": 50, "temperature": 0.1}, } resp = requests.post( f"{WORKSPACE_URL}/serving-endpoints/support-intent-classifier/invocations", headers={"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json"}, data=json.dumps(payload), ) return resp.json()["predictions"][0] print(classify_intent("My order hasn't arrived and it's been 10 days")) # → "shipping_delay" Comparing Fine-Tuning Strategies StrategyGPU MemoryTraining TimeQuality vs Full FTWhen to UseFull Fine-TuningVery High (80GB+)SlowestBaseline (100%)Max quality, large budgetLoRAMedium (24–40GB)Fast~95%Best general-purpose choiceQLoRA (4-bit + LoRA)Low (10–16GB)Medium~90–93%Single GPU, cost-sensitivePrefix TuningLowVery Fast~80–85%Minimal compute, quick iterationPrompt TuningVery LowFastest~70–80%Inference-only, no weight changeRLHF / DPOHighSlowestBest alignmentInstruction-following qualityDistillationMedium (teacher)MediumVariesSmaller, faster inference model Rule of thumb: Start with QLoRA on a single GPU. If eval loss stagnates or quality gates fail, move to LoRA on multi-GPU. Full fine-tuning is only warranted when you have >1M high-quality labeled examples and a measurable business case for the incremental quality gain. Key Takeaways Spark handles data at scale before training even begins — filtering, tokenization, and splitting across millions of records in minutes.QLoRA + LoRA makes fine-tuning 7B–13B models accessible on a single A100, reducing memory footprint by ~70% with minimal quality loss.MLflow report_to="mlflow" gives you automatic experiment tracking with zero extra code — every loss curve, gradient norm, and learning rate schedule is captured.Unity Catalog model aliases (candidate → staging → champion) replace brittle version-number references in deployment code, making promotions and rollbacks a one-liner.Auto Capture on Databricks Model Serving logs every inference request and response to a Delta table — giving you a feedback loop to build your next training dataset.Horovod on Spark is the right tool when single-node training exceeds your SLA — it leverages your existing Spark cluster without a separate orchestration layer. References Databricks — LLM Fine-Tuning on DatabricksMLflow — Transformers Flavor DocumentationHugging Face PEFT — LoRA & QLoRAQLoRA Paper — "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)LoRA Paper — "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021)Databricks — Model Serving (Foundation Model APIs)Horovod on Spark — Official DocumentationDatabricks — HorovodRunner APIDatabricks — Inference Tables (Auto Capture)"Training language models to follow instructions with human feedback" — InstructGPT / RLHF (OpenAI, 2022)
She had everything on the list. Eight years of experience. Strong systems design. Distributed architecture under her belt. The panel interview went well — one of the hiring managers later described it as the best technical conversation they'd had with a candidate all quarter. The team passed on her. Two weeks later, during a casual conversation with that hiring manager, the reason came out. It wasn't her architectural skills or her communication. It was a question someone had slipped in near the end: "Walk us through how you'd set up an AI-assisted code review pipeline for a team that ships twelve microservices." She described doing it manually. The other finalist described standing up an orchestration layer with context-aware models, configuring fallback thresholds, and building observable feedback loops that trained the team's prompt library over time. Same job title. Completely different mental model of what the job now involves. That story isn't unique. It captures something that's been happening gradually over the past eighteen months and then very suddenly in the last six: the senior developer role has quietly split into two jobs. One of them is the job we all trained for. The other is the job that a meaningful portion of your working week now actually requires. And the gap between developers who've accepted that and developers who haven't is becoming very hard to explain away in performance conversations. The Split That Happened Without a Memo Let's be specific about what the "AI Systems Architect" half of the role actually means, because people either over-mystify it or undersell it. It doesn't mean you become a data scientist. It doesn't mean you're fine-tuning models or writing PyTorch. Those are real jobs — they're just different jobs. What it means is something more operational and less glamorous: you are now responsible for designing, maintaining, and improving the systems of AI assistance that your team works inside of, not just the code that the team produces. That sounds abstract until you break it into daily decisions. Which tasks should be fully AI-generated versus AI-assisted versus AI-reviewed only? Where are your model's blind spots for your specific codebase, and how do you account for them in code review? When a junior developer on your team gets a plausible-but-wrong architectural suggestion from an AI assistant, what's the escalation path? How do you measure the quality of your team's prompting over time? These aren't rhetorical questions — they're operational ones that live teams are answering right now, often badly, because no one assigned anyone to own them. Senior developers are getting assigned to own them. Not officially. Not with updated job descriptions. Just through the ordinary mechanism of "this problem needs solving, and you're the most experienced technical person in the room." What "AI Systems Architect" Actually Means Day to Day The phrase sounds bigger than the practice. What it actually breaks down to is four interconnected responsibilities that are now landing on senior developers, whether they want them or not. First: workflow design. Someone has to decide which parts of the development cycle use AI assistance, at what level of autonomy, and with what human checkpoints. At most companies, this currently happens by accident — everyone develops their own habits, and nobody compares notes. The developers who are stepping into the architect half of the role are the ones making that deliberate, rather than emergent. Second: model selection and configuration. Not fine-tuning, but product-level decisions: which models for which tasks, what context window strategy, how to handle codebases that exceed context limits, what fallback behavior looks like. These are practical engineering decisions that live in the space between "developer tool choice" and "infrastructure decision." They belong to senior engineers. Third: quality governance. AI-generated code introduces a new failure mode: plausible-looking outputs that are subtly wrong. The patterns of wrongness are specific and learnable. Senior developers who have mapped the failure modes of their AI tooling — the kinds of edge cases it consistently misses, the naming convention assumptions it gets backward, the security patterns it handles confidently and incorrectly — are providing a form of institutional knowledge that is genuinely hard to replace. Fourth: team prompting culture. This is the one nobody talks about at conferences yet, but engineering managers across the industry have been mentioning it consistently over the past six months: the quality variance in how different team members prompt their AI tools is enormous, and it compounds. Senior developers who build and maintain shared prompt libraries, who do prompt review the way they do code review, who can diagnose why a colleague got a bad output — those developers are operating as a force multiplier for the entire team, not just themselves. The Job Description Before and After: A Concrete Comparison This is worth making explicit. Analysis of actual senior engineer job postings — anonymized, from companies between 80 and 1,200 employees — shows a clear shift when comparing what the role requirements looked like in early 2023 versus what's being written now. The change is real and measurable. The pattern across all of it: the what of the role hasn't changed so much as the how and the governance around it. Senior developers are still responsible for the same categories of work. They're now also responsible for the design of the AI-assisted systems that help a team do that work, and for the failure modes those systems introduce. The New Core Competency Stack Here's what the competency model looks like in practice when you lay it out. The traditional side should feel familiar. The AI architecture side probably contains a few items you haven't formally owned yet — but if you've been doing this job for more than two years and paying attention, you've been building these skills without realizing it. The Salary Premium Is Already Real Compensation data lags reality by about eighteen months, so take specific numbers here with appropriate skepticism. What industry reporting suggests is that a clear pattern is emerging: developers who can demonstrably operate in both halves of the new role — not just use AI tools personally, but architect AI-assisted workflows for a team — are commanding a premium that's running somewhere between 18% and 31% above their single-track counterparts at the same years-of-experience mark. That range is wide. The premium is highest in companies that have recently invested in AI transformation initiatives and learned, the hard way, that "everyone uses Copilot" is not the same as "we have a coherent AI engineering strategy." Those companies are specifically recruiting for systems architect skills because they've already paid for the gap. How to Build the Second Half of the Job Nobody teaches this in a course yet. There are some good books and a growing number of blog posts, but the skills are mostly developed through deliberate practice and iteration. Based on teams that have successfully made this transition, here's what works. The starting point is mapping your team's current AI-assisted work honestly. Not aspirationally — honestly. Which tasks are you and your team currently doing with AI assistance? Where does the output go without sufficient review? What are the categories of error you've caught, and what categories might you be missing? This audit, done once and updated quarterly, is the foundation of a governance practice. From there, the most leveraged thing most senior developers can do is build a shared prompt library for their most common task types. Not a personal one — a shared one, with a versioning and review practice attached. The discipline of reviewing a colleague's prompt and explaining why it produced a wrong output is one of the fastest ways to build the mental model you need for the governance half of the role.