Your AI Agent Tests Are Passing, But Your Agent Is Still Broken

How to test AI agents that call tools — five patterns using traces and behavior contracts to catch bugs your current tests miss.

Biresh Patel

May. 28, 26 · Analysis

Likes (0)

Comment

Save

2.7K Views

I was building an AI agent that reads log files, calls APIs, and runs tools based on user instructions. Standard stuff for anyone working with LLM-based automation today.

I wrote Playwright tests for it. The tests were green. The agent was lying.

Here is what happened, and what I had to build to fix it.

The Trap I Walked Into

As covered in Building a New Testing Mindset for AI-Powered Web Apps, "unlike a rules-based form, the AI agent might phrase the same question differently each time — making it impossible to write a single pass/fail test script." I hit this immediately.

My first test looked like this:

    TypeScript
   
   expect(output).toBe("I read logs/test-results.log. Summary: 2 tests failed, 8 passed.");

It passed last week. It failed this week. The model said:

    Plain Text
   
   I checked logs/test-results.log. Summary: 8 passed, 2 failed.

Same meaning, but different words, different order, and Test broken.

So I switched to snapshots - same problem, bigger diffs. Then, regex is fragile and impossible to maintain. Then I checked only HTTP status and "no crash" — tests went green while the agent picked the wrong tool entirely or gave a confident, wrong answer.

After all of that, I realized the issue: I was treating LLM output like fixed copy. I was testing the model's writing style, not the agent's behavior.

The Bug That Changed How I Think About This

This is the one that made the problem concrete for me.

The task: "Read notes/meeting.txt and give me a one-line summary."

My test:

    TypeScript
   
   expect(reply.trim().length).toBeGreaterThan(0);

The agent returned a perfectly normal sentence. Test passed.

What actually happened: the model never read the file. It guessed a plausible summary from the prompt alone and returned it as if it had done the work. The reply was non-empty, so the assertion was satisfied.

That test wasn't checking agent behavior. It was checking that the model could generate a sentence, which it always can.

The question I needed to answer was not "did it return text?" but "did it actually call the file-reader tool?" Those are different questions entirely.

What to Test Instead

Effectively Managing AI Agents for Testing puts it well: agents are best understood as a system prompt combined with state, memory, and a selection of tools. That definition is exactly why testing them requires a different approach — you are testing decisions, not return values.

When I stepped back, I realized agent testing has three distinct layers that traditional assertions don't cover:

Decisions – which tool did it pick, and did it pick the right one?
Sequence – for multi-step tasks, did it follow a valid order?
Output rules – does the answer satisfy flexible behavioral rules, not a frozen string?

None of these maps cleanly to expect(output).toBe(...)

What I Built

I built AgentAssert - a Playwright-based reference implementation of five testing patterns for agents that call tools.

The core idea: instead of asserting on the final text, assert on the trace — a complete log of every decision the agent made, every tool it called, and every result it received.

    TypeScript
   
   const trace = await agent.run("Read logs/app.log and summarize errors");

// Did it actually use the tool?
AgentAssert.toolWasInvoked(trace, 'file-reader', { filePath: /.*\.log$/ });

// Did it say the right kind of thing?
AgentAssert.satisfiesContract(trace.output, BehaviorContract.SUMMARIZATION);

The five patterns the repo demonstrates:

Pattern 1 – Tool Invocation: Did the agent call the right tool? This catches the meeting.txt class of bug - a confident-sounding answer with no actual work behind it.

Pattern 2 – Behavior Contracts: Does the output satisfy flexible rules (required fields, must-include concepts, forbidden phrases) without requiring exact wording? The contract matcher is rule-based - keywords and patterns - not a second AI model. It is inspectable and cheap to run.

Pattern 3 – Multi-Step Trace Verification: For tasks that require two tools in sequence, did the agent follow the right order? Browser tests check page state. These tests check the agent's internal reasoning path.

Pattern 4 – Boundary Enforcement: Did the agent stay within its allowed tools, or did it hallucinate tool names and try to call things it shouldn't? This one catches scope creep early.

Pattern 5 – Failure Observability: When a tool errors, does the agent report the failure honestly or claim success anyway? Most agent test suites never simulate tool failures. This pattern forces it.

Why Playwright and not Jest

This repo uses Playwright as the test runner, which surprised a few people who reviewed it. Playwright is usually a browser testing tool.

The reason is practical. Agent tests are slow and flaky by nature — LLM responses vary, API calls take time. Playwright gives you per-test timeouts, built-in retries, HTML reports with attachments, and worker-level isolation. Jest requires plugins or manual configuration for all of that. When a behavioral test fails, the HTML report shows the full agent trace attached directly to the failure — which tool ran, in what order, and what the model said at each step.

Playwright's capabilities go well beyond browser testing. Master API Testing with Playwright covers how it handles retries, timeouts, and network interception for backend flows. AgentAssert builds on those same strengths - applied to LLM tool-call loops instead of HTTP endpoints.

Using Playwright without a browser is unconventional. But for this problem, it fits better than the alternatives.

What This Doesn't Solve

The contract matcher works on keywords and patterns. If the agent says "unable to locate the file" instead of "file not found" and your contract only lists one phrasing, it may fail even though the meaning is the same. This is a real limitation.

More sophisticated approaches exist. 5 Agent CI/CD Evaluation Best Practices describes using an LLM-as-judge with soft and hard failure thresholds. That approach is more powerful but adds cost and latency. The contract matcher here is deliberately simpler - inspectable rules you can read and tune in one file.

This repo also does not test security, production monitoring, or external system behavior. It tests what you define rules for. The value lies in catching common failures — wrong tool, wrong order, false success, scope violations— at a low cost and with repeatability in CI.

The Shift in Mindset

When I finished building this, the thing that stuck was not the code. It was the reframe.

Software Testing in the LLM Era describes how the tester's role is moving from executing scripts to validating AI decisions. The five patterns in this repo are one practical step in that direction.

Agents are not functions. You cannot test them the way you test a function that returns a fixed value for fixed input. An agent makes decisions. You need to test the decisions — what it chose to do, in what order, and whether it stayed honest when things went wrong.

The code is at github.com/bireshpatel/agent-assert. It is a reference implementation, not a published library. Copy the patterns, adapt the framework to your agent, and replace the sample tools with your own.

AI Testing

Published at DZone with permission of Biresh Patel. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending