The Rise of AI Orchestrators

A reusable multi-agent QA framework where specialist agents debate, a judge decides, and software testing becomes an AI-orchestrated lifecycle.

May. 06, 26 · Analysis

Likes (2)

Comment

Save

3.5K Views

“They have not written a single line of code since December,” and now “only generate code and supervise it.” That is how Spotify co-CEO Gustav Söderström described some of the company’s most senior engineers on Spotify’s Q4 2025 earnings call. He added that the change is “real” and “happening fast.”

That remark matters not because it is provocative, but because it points to a broader shift in engineering work. As code generation becomes faster and cheaper, the differentiator moves upward: toward framing the problem correctly, setting constraints, reviewing outcomes critically, and deciding what is actually ready to ship. Value does not disappear; it relocates from raw production to direction, judgment, and accountability.

Software testing is one of the clearest places to see why this matters. Traditional automation still delivers value, but its weaknesses are familiar: brittle locators, maintenance-heavy UI checks, and imperative logic that struggles when an application changes in ways the script did not anticipate. A small design change can break a test while the underlying business behavior remains correct. At the same time, subtle visual and behavioral issues can still slip through unless the framework is loaded with assertions, snapshots, and comparison rules.

This is where AI orchestration becomes useful in QA. The point is not that automation disappears, and it is not that engineers stop needing technical depth. The point is that the center of effort begins to shift. Instead of investing most of the energy in hardcoding every path by hand, teams can invest more of it in defining intent, expected behavior, quality boundaries, and risk clearly enough for an AI-driven workflow to execute, challenge, and refine.

That is where the idea of the AI orchestrator becomes practical. Not as someone who passively consumes AI output, but as the engineer who structures the work: defining the mission, coordinating specialized agents, reviewing disagreements, and keeping execution aligned with product reality. In that model, the engineer stays fully in the loop, but the role moves upward — from script author to system director.

For quality engineering, this is a meaningful change. Testing has always required more than mechanical execution. It depends on interpretation, prioritization, and judgment. AI does not remove that need; it makes it more central. The better machines become at executing, the more important it becomes for someone to define what should be tested, what matters most, what is risky, and what “correct” actually means.

That is the direction this framework explores: not AI as a novelty layer on top of traditional automation, but AI as an orchestration model for the testing lifecycle itself.

From Research to QA Practice

This framework was influenced by Eric Garcia’s Zenodo publication, N+1 Alignment Dialogue Architecture: Technical Specification for Defensive Publication. The Zenodo record describes a multi-agent system in which N parallel specialist agents are orchestrated by a single Judge agent, with file-based state coordination and structured deliberation rounds designed to drive convergence.

My approach is simpler and more practical. Instead of implementing a general-purpose N+1 consensus engine, I want a QA-oriented orchestration model mapped directly to a real testing workflow. The goal is to preserve the strengths of structured multi-agent review while making the system reusable as a repeatable framework for software quality engineering.

A second research idea fits this design particularly well: Verbalized Sampling. The paper describes it as a simple, training-free prompting strategy in which the model verbalizes a probability distribution over a set of responses rather than collapsing to a single answer.

That is the missing piece for a Judge-based QA workflow. In many agent systems, each subagent returns one answer, and the Judge chooses between fixed opinions. In this framework, each subagent returns a small probability distribution over plausible answers, and the Judge evaluates not only the strongest answer, but also the uncertainty profile, the degree of convergence across agents, and the evidence supporting each option. That turns the review chain from a sequence of opinions into a structured decision process.

The Agent Fleet

This framework uses a fixed fleet of five agents, each modeled after a real delivery role.

Senior automation engineer: Focuses on tooling, browser execution, automation feasibility, technical validation, and the practical use of Playwright-based workflows.
Senior QA analyst: Focuses on requirements interpretation, test coverage depth, defect quality, edge cases, negative scenarios, and risk-based thinking.
Project manager: Brings release perspective into the process: priorities, dependencies, scope control, sequencing, and business alignment.
Principal engineer: Challenges weak assumptions, reviews architectural soundness, and checks whether the proposed testing approach is technically credible and scalable.
CTO as judge: Does not perform routine execution. Instead, this role evaluates the debate across each phase, determines whether the result is strong enough to move forward, and can request another review loop when the team has not reached a reliable conclusion.

SDLC-Oriented QA Orchestration

1. Requirement Analysis

This phase gathers the source truth that the system will test against. The framework reads the relevant work item from a test case management platform, supporting documentation, and the design file if one exists. The goal is to turn those artifacts into a requirement map, identify ambiguity, capture visual expectations, and surface risk early.

2. Test Planning

This phase turns requirements into a strategy. The team creates a visual test plan, defines scope, prioritizes flows, identifies coverage goals, and decides how execution should happen.

3. Test Case Development

This phase produces the test cases themselves: expected outcomes, negative scenarios, edge conditions, and traceability back to the original requirements. The emphasis is not only on what to test, but on why each check matters.

4. Test Environment Setup

If needed, the system verifies that the environment is ready: access, credentials, fixtures, configuration, dependencies, and any limitations that could invalidate the run.

5. Test Execution

This is where the framework performs the validation. The system opens the browser, navigates the application, runs the planned checks, captures evidence, and compares actual behavior against expected behavior and design intent.

6. Test Cycle Closure

This final phase documents what happened, summarizes defects and risk, records evidence, and suggests follow-up actions or changes.

The structure is deliberately close to traditional QA practice. The difference is that each phase is coordinated through specialized AI roles and controlled review loops instead of being handled as an isolated manual effort.

The Probabilistic Debate Layer

The main implementation change is simple: important outputs do not move forward as single answers.

For critical artifacts such as requirement interpretation, visual test plans, test case packages, anomaly diagnosis, and closure recommendations, each debating agent returns three to five candidate positions with probabilities instead of one final answer. This is the part of the framework most directly influenced by Verbalized Sampling.

So the flow becomes:

Senior automation engineer → candidate distribution
Senior QA analyst → candidate distribution
Project manager → candidate distribution
Principal engineer → reviews the distributions and challenges weak reasoning
CTO judge → decides using content quality, uncertainty, convergence, and evidence

That gives the Judge something much more useful than five opinions. It gives the Judge the leading candidate, the alternatives, the uncertainty spread, the disagreement pattern, and a signal for whether another loop is needed.

What the Judge Actually Evaluates

Consensus strength: If multiple agents independently place most of their probability mass on the same candidate, that is a strong signal.
Uncertainty spread: A distribution like 0.82 / 0.12 / 0.06 is much more decisive than 0.37 / 0.34 / 0.29. Diffuse distributions are often a sign that the phase is not stable enough to approve.
Role relevance: Not every role matters equally in every phase. A Senior QA Analyst may matter more during requirements and test-case design, while a Senior Automation Engineer may matter more during environment setup and execution.
Evidence quality: A confident answer with weak artifact support should not outrank a slightly less confident answer backed by requirements, design files, logs, screenshots, or execution traces.

Judge Scoring Model

A practical scoring formula is:

Final Candidate Score = Σ (Agent Probability × Role Weight × Evidence Weight)

Where:

Agent probability is the probability assigned by the agent
Role weight reflects how much that role should matter in the phase
Evidence weight reflects how strongly the candidate is supported by artifacts or execution evidence

Suggested Role Weights by Phase

Phase	QA Analyst	Automation Eng.	Project Manager	Principal Eng.
Requirement Analysis	0.35	0.20	0.30	0.15
Test Planning	0.30	0.30	0.25	0.15
Test Case Development	0.35	0.30	0.15	0.20
Test Environment Setup	0.20	0.45	0.10	0.25
Test Execution	0.25	0.40	0.10	0.25
Test Cycle Closure	0.30	0.15	0.25	0.20

The CTO judge is not part of the weighted vote. The CTO is the final decision layer.

Judge Decision Policy

Approve: Use this when one candidate clearly leads, the evidence is strong, and the relevant agents converge.
Request another loop: Use this when top candidates are too close, probability distributions are diffuse, or the principal engineer identifies unresolved technical risk.
Reject: Use this when the reasoning is weak, key evidence is missing, or the proposed outcome conflicts with requirements, restrictions, or business constraints.

A practical threshold is this: if the top candidate's score is less than 15% higher than the second candidate's, the default should be another loop unless the evidence behind the top candidate is unusually strong.

Why Playwright CLI With Claude

For visual testing, I want to use Playwright CLI together with Claude.

That choice is intentional. The Playwright CLI repository says that, for coding agents, CLI is the best fit, and explains that CLI plus Skills is more token-efficient than MCP-style workflows because it avoids loading large tool schemas and verbose accessibility trees into model context. The same README lists “Token-efficient. Does not force page data into LLM” among the key features.

In this framework, Playwright CLI is not just a driver. It is the execution surface for visual validation. Claude provides reasoning and interpretation; Playwright CLI provides action and evidence. Together, they make the execution phase practical for a reusable QA workflow.

A Plugin-Style Template

The framework is meant to behave like a plugin-style template, not like a one-off system rebuilt for every project.

A team should be able to feed project data into the workflow — artifacts from a test case management platform, documentation, design files, rules, restrictions, environment details, and custom hooks — and then run the same structured lifecycle. The orchestration logic should remain stable while the project-specific inputs change.

That is the central design principle: input-driven, not rewrite-driven.

A QA engineer should be able to adapt the framework by updating requirements, prompts, rules, restrictions, project assets, environment configuration, and hook bindings, without redesigning the orchestration model itself.

Proposed Repository Structure

    Plain Text
   
 

   qa-ai-orchestration-template/
 │
 ├── agents/
 ├── docs/
 ├── skills/
 ├── requirements/
 ├── rules/
 ├── restrictions/
 ├── prompts/
 ├── hooks/
 ├── schemas/
 ├── execution/
 ├── config/
 └── README.md
  

Conclusion

The rise of AI orchestrators does not mark the end of the QA engineer. It marks the end of treating testing as a purely mechanical exercise. As execution becomes cheaper, judgment becomes more valuable. The teams that benefit most from AI will not be the ones that automate the fastest, but the ones that learn how to structure decisions, challenge outputs, and keep quality aligned with real product risk. In that sense, the future of software testing may belong not to better scripts, but to better orchestration.

AI Command-line interface Testing

Opinions expressed by DZone contributors are their own.

Related

Trending