Multi-Agent Software Engineering: One Coding Agent Isn't Enough

Multiple AI agents with clear roles and checks deliver real software better than one agent, but cost more and only suit large tasks.

Jithu Paulose

Ashly Joseph

Jul. 02, 26 · Analysis

Likes (0)

Comment

Save

1.7K Views

Coding agents are good now. They can write a function, fix a failing test, or walk you through a chunk of legacy code you'd rather not read. That part is settled. The harder question is what happens when you hand one a real piece of delivery work, something that has to change the database and the API and the UI and the tests all together, and keeps running long after you've stepped away from your desk.

That's usually where a single agent starts to struggle, and it isn't because the model isn't smart enough. The limit is human attention. A team might have fifty things sitting in its backlog that an agent could help with, but somebody still has to scope each one, keep an eye on it, review what comes back, and confirm it actually works. So you can generate code far faster than before and still ship at about the same pace. The slow part just moved.

Long delivery work is a different animal from a quick coding task. It needs someone to hold the scope steady, keep the architecture consistent from one file to the next, make sure the tests check what the feature is meant to do rather than what the code happens to do, review the result, and hand off cleanly to whatever comes next. Ask one agent to carry all of that in a single context window across a long run, and it tends to drift. You've probably watched it happen: it loses the plot halfway through, writes tests that pass only because they were shaped around the code it just produced, uses one pattern here and a different one three files over, rebuilds something that already existed, and then can't quite tell you what it finished and what it didn't. So you read every diff yourself. The agent writes code, and you're still doing the planning, reviewing, QA, and firefighting. There's a limit to how far that stretches.

From One Agent to a Team

A more workable setup is to stop giving one agent the whole job and split it the way a functioning team already does. One agent plans the work, another builds it, another checks it. Three roles get you most of the way.

Role	Responsibility
Orchestrator	Understands the goal, asks the clarifying questions, writes the plan, sets milestones, and decides how the work is sequenced.
Worker	Implements one feature from clean context and commits it in a controlled way.
Validator	Checks the implementation independently, runs the checks, verifies behavior, and flags follow-up work.

Keeping the building and the checking in different hands matters for the same reason people review each other's code. Whoever wrote it is invested in it working, and that bias is hard to spot from the inside. A fresh agent that had no part in those decisions tends to catch what the author missed.

How Agents Coordinate

Underneath the roles, the agents end up talking to each other in a few recurring ways, and it helps to have names for them.

Delegation is the obvious one, and usually the first that teams build. An agent hands a scoped task to another and waits for the result.

Creator-verifier is the one that matters most for software. One agent writes the code and a separate one, working from its own context, checks it. That separation is what stops an agent from grading its own homework.

Direct communication lets agents talk without a coordinator in the middle. It's tempting and it's fragile, since state scatters across separate conversations and sooner or later somebody acts on something out of date.

Negotiation is what happens when agents share a resource, which for us usually means the codebase. Two agents about to edit the same file have to work out who does what before they overwrite each other.

Broadcast is one agent telling the rest about something that changed, like a new constraint or a failure everyone needs to know about. It's the least exciting of the five, and the one that quietly keeps the long run from falling out of sync.

Define "Done" Before Any Code Gets Written

Settling what "correct" means before anyone writes code does more for reliability than any amount of prompt tuning. It heads off a specific and very common failure. An agent builds a feature, then writes tests that wrap neatly around the feature it just built. Everything passes, coverage looks healthy, and none of it tells you whether the feature does what was actually asked for. Tests written after the code mostly confirm whatever the code already does. They don't find the bugs.

A validation contract flips that order. During planning, before there's any code, you write down what the feature has to do: the behavior that has to exist, the edge cases that matter, the flows that have to work, the regressions you can't allow. A small change might need a handful of those. A big feature can need hundreds, spread across the backend, the API, the front end, and the full end-to-end paths. Each one gets tied to a feature, and a feature isn't finished until it satisfies the ones assigned to it. The effect is that "done" gets defined separately from however the code happens to come out. Workers build against the contract, validators check against it, and you stop relying on whether the code looks right and start measuring whether it works.

Passing Tests Aren't the Same as Working Software

You still want lint, type checks, unit tests, and code review. The trouble is that once an agent is shipping whole features on its own, those checks stop being enough. Plenty of changes pass every unit test and are still broken where it counts. The form renders fine, but the submit button does nothing. The endpoint returns exactly the right shape, filled with stale data. A flow that worked in isolation falls apart once it sits behind a login. A migration runs clean on a laptop and chokes on production-scale data.

So the better systems add a validator that works more like a QA engineer than a linter. It launches the app, clicks around, fills in forms, and confirms the whole path works end to end. That's slow, and on a long task it's where most of the wall-clock time goes: not generating tokens, but waiting on a live application to do something and watching what it does. The trade is worth it, since generating code quickly without really checking it only gets you to the wrong answer faster. In one production run an engineer at Factory described, building a clone of Slack, the project finished with about half its lines of code being tests, and roughly 90% coverage, and the validation step never passed on its first try. That last part is the whole reason the loop exists.

Long Runs Can't Rely on Memory

Run something for hours or days and context starts leaking between the agents. A bigger context window doesn't really fix it. What helps is not letting a worker close out a task by simply announcing it's done. Instead, each worker leaves a written handoff: what it built, which files it touched, which commands it ran and how they exited, what it assumed along the way, what it ran into, and what it left unfinished.

That makes the run auditable. When validation fails, the orchestrator reads back through the handoffs, works out where things went sideways, scopes the fix, and pulls the run back on track at the next milestone instead of discovering the mess at the very end. The teams who make this work don't count on their agents remembering anything; they write enough down that the next agent can safely pick up where the last one stopped. Factory has reported runs lasting as long as sixteen days on this kind of setup.

More Agents Isn't More Throughput

The instinct is to run everything in parallel. Ten agents should mean ten times the work, right? For software, it usually doesn't play out that way. Agents running at the same time tend to edit the same files, redo work that's already done, and make architectural choices that don't line up with each other. The effort of untangling all that eats whatever speed you gained, and you pay for the conflict in tokens on top of it.

What works better is to run the actual changes one at a time and save the parallelism for read-only work, like searching the codebase, reading docs, looking up an API, or reviewing code. On paper that's slower. Over a long task it comes out ahead, because you spend far less time cleaning up conflicts, the handoffs stay cleaner, and the whole thing behaves more predictably. Pile on more agents without coordinating them and you don't get speed so much as a codebase that disagrees with itself.

The Right Model in Each Seat

These systems also change how you pick models, because no single model is the right choice for every seat. Planning tends to go better with a model that reasons slowly and carefully. Writing code rewards speed and fluency instead. Checking the work rewards something closer to stubbornness: following the instructions exactly and giving nothing the benefit of the doubt. The model that writes the best code is often not the one you'd trust to grade it. There's even a case for running the validator on a different provider, so it doesn't carry the same blind spots as the model that wrote the code.

That's the argument for staying model-agnostic. You want to put the right model in each role and swap it out as models get better at particular things, rather than getting stuck with one vendor's weakest area showing up everywhere. It works in the other direction too. A solid scaffolding of contracts, checkpoints, and independent validators can prop up a weaker or open-weight model and get more out of it than it would manage alone. Most of the orchestration in these systems lives in prompts and skills rather than hardcoded logic, which is the reason a new model release tends to make them better instead of obsolete.

The Case for Fewer Agents

Everything up to here makes the case for splitting work across agents, so it's only fair to take the strongest counterargument seriously. In 2025, the team behind Devin put out a post titled "Don't Build Multi-Agents," and the heart of it is hard to dismiss. They argue that most multi-agent failures come down to context getting fragmented. When you fan work out to parallel subagents, each one quietly makes its own assumptions, and those assumptions don't reconcile when the pieces come back together. One subagent picks a naming convention, another picks a different one, and you're left with something that reads as coherent but doesn't actually fit. Their advice is to keep one agent on a single thread and compress the context as it grows instead of spreading it across a crowd of workers.

Anthropic landed somewhere close, though more conditional, when it wrote up its own multi-agent research system around the same time. Splitting work across agents paid off for broad, parallel tasks like searching many sources at once, but it struggled on anything that needed one shared context and tight coordination, which is most of what software work is. Both write-ups end up pointing at the same shape described here. Don't run agents in parallel on tightly coupled work. Split the work by role, and let the coupled parts happen in order.

What the Failure Data Shows

This isn't only field intuition, either. In 2025, a group at Berkeley published a study called "Why Do Multi-Agent LLM Systems Fail?" that went through failure traces from several well-known frameworks and grouped what went wrong. What stood out was where the failures landed. They mostly weren't about the model being too weak. They were about design, with agents given vague roles or ignoring the roles they had; about coordination, with one agent sitting on information another needed or a conversation getting reset partway through; and about verification, with work marked finished that nobody really checked, or a run quitting too early. Those are the same three places this whole architecture tries to shore up, with clear roles, written handoffs, and validators that don't simply take an agent at its word.

There's also hard evidence that giving each worker fresh context is more than tidiness. The "lost in the middle" research found that models pay the most attention to the start and end of their context and the least to whatever sits in the middle. Later work on "context rot" found accuracy slipping as the input gets longer, even on simple lookups. A worker drowning in a long accumulated history is a real, measured liability, not a theoretical one, and handing each worker a clean slate keeps the model working in the range where it's actually reliable.

The Bill Comes Due

It's easy to underestimate what these systems cost. More agents running for longer means a lot more tokens. Anthropic reported that a single agent already burns through several times the tokens of an ordinary chat, and a multi-agent system can use roughly an order of magnitude more on top of that. That only pencils out on work that's worth the spend. Running a multi-agent system to fix a typo is just an expensive way to fix a typo.

A couple of things keep it in check. One is prompt caching. A long run reads the same stable context over and over, the system prompt, the codebase, the plan, and caching that material so it isn't reprocessed every time cuts the bill sharply, which is why anyone running these in production leans on it. The other is the serial discipline from earlier: every conflict you don't create is a repair cycle you don't pay for, and repair cycles are where a lot of tokens quietly disappear. How much these systems cost is mostly a design question, not a billing one.

A Bigger Attack Surface

Security rarely shows up on the architecture diagram, and every agent you add is another door. Even a single agent has a well-known soft spot in prompt injection, where instructions tucked into a web page or a file or a tool's output get read as commands rather than data. Add more agents and the problem grows. A poisoned document that one worker reads can smuggle instructions through a handoff into another worker with more access, or one that touches production directly. The shared state and the messages agents pass around become a channel an attacker can aim at on purpose.

This is the kind of thing you build in from the start, because it's painful to bolt on later. The same controls that keep these systems correct also keep them safer. Validators that won't take an agent's own word for it, handoffs that record exactly which commands ran and what came back, limits on what any single worker is allowed to reach, all of that doubles as containment, so one compromised step can't quietly become a compromised system. The audit trail that helps a run recover from its own mistakes is the same one you'll be glad to have when something goes wrong on purpose.

Where This Leaves the Engineer

None of this puts engineers out of work. It moves the work up a level. Instead of hand-driving every step of an implementation, you spend your time deciding what should get built, what the real constraints are, what counts as correct, which parts of the architecture are worth protecting, and when a human has to sign off. It feels more like running a delivery operation than like chatting with a bot.

And the biggest gain usually isn't speed. It's keeping several streams of work moving at once without quality slipping, and often ending up with a codebase in better shape than when you started, since the tests and checks and handoffs all become part of what ships. The real skill is knowing when to reach for any of this. For a small, contained change, one good agent on a single thread is simpler and cheaper and less likely to wander off. For serious delivery at scale, you need the planning and checking and recovery that a team provides, and the only way agents can do that work is inside the same kind of structure a team uses: real roles, a shared definition of done agreed before anyone starts, honest handoffs, shared state, and execution kept under control rather than just turned up to full speed.

Coding (social sciences) AI

Opinions expressed by DZone contributors are their own.

Related

Trending