Model-Based Testing for AI Agents [Deep Dive 2026]
Bottom Line
Treat the agent workflow as a state machine, not a pile of example prompts. Once you model allowed actions and invariants, MBT will generate edge-case sequences humans rarely think to test.
Key Takeaways
- ›Hypothesis defaults to 100 examples and 50 stateful steps per example.
- ›Model the workflow first: states, actions, guards, and invariants.
- ›Use a thin harness so MBT drives the real orchestration layer, not mocks only.
- ›The highest-value invariant is usually sequence safety around tool calls.
- ›Run MBT with pytest -q in CI and shrink failures into minimal repros.
Complex AI agents fail less often because a single prompt is wrong and more often because a sequence is wrong: a tool result arrives late, the planner answers before validation, or memory mutates across retries. Model-Based Testing fixes that by testing the workflow as a state machine. In this guide, you’ll use Python, Hypothesis stateful testing, and pytest to generate realistic action sequences and catch invalid transitions before they land in production.
Prerequisites
What you need
- Python 3 with
pytestandhypothesisinstalled. - An agent orchestration layer with explicit operations such as submit goal, request tool, receive tool output, and answer.
- A place to store transcripts or traces from test runs for debugging and shrinking failures.
- Redacted fixtures if you replay real conversations; use the Data Masking Tool before committing prompts or tool payloads.
Bottom Line
The winning pattern is simple: keep a small abstract model, drive the real agent through a thin harness, and assert invariants after every generated step. That gives you broad workflow coverage without hand-authoring dozens of brittle end-to-end cases.
Before writing any code, decide what counts as a workflow bug. For most tool-using agents, that list is short and concrete:
- A final answer is emitted while a tool call is still unresolved.
- A tool result is accepted even though no tool was requested.
- Memory or planner state leaks across a reset or retry.
- The same tool is called twice when the workflow contract allows only one call.
Step 1: Model the workflow
Start with the smallest state model that still exposes the agent’s contract. Do not mirror every internal class. Model only what determines legal next steps.
Pick the states and transitions
idle: no user goal yet.goal_received: the agent may plan, request a tool, or answer directly.tool_pending: the agent must wait for tool output.completed: the final answer is sent and no further action is legal.
Then encode a minimal harness. This adapter should call your real orchestration entry points, while exposing just enough state for the test model.
from dataclasses import dataclass, field
@dataclass
class AgentHarness:
state: str = 'idle'
last_tool: str | None = None
transcript: list[dict] = field(default_factory=list)
def submit_goal(self, text: str) -> None:
assert self.state == 'idle'
self.state = 'goal_received'
self.transcript.append({'role': 'user', 'text': text})
def request_tool(self, tool_name: str) -> None:
assert self.state == 'goal_received'
self.state = 'tool_pending'
self.last_tool = tool_name
self.transcript.append({'role': 'assistant', 'tool_call': tool_name})
def return_tool_result(self, payload: dict) -> None:
assert self.state == 'tool_pending'
self.state = 'goal_received'
self.transcript.append({'role': 'tool', 'name': self.last_tool, 'payload': payload})
def answer(self, text: str) -> None:
assert self.state == 'goal_received'
self.state = 'completed'
self.transcript.append({'role': 'assistant', 'text': text})
def reset(self) -> None:
self.state = 'idle'
self.last_tool = None
self.transcript.clear()Step 2: Build the state machine
Now let Hypothesis generate whole workflows instead of single inputs. Its RuleBasedStateMachine abstraction is the right fit because it chooses both values and action order.
Encode actions as rules
import hypothesis.strategies as st
from hypothesis import settings
from hypothesis.stateful import RuleBasedStateMachine, initialize, invariant, precondition, rule
@settings(max_examples=100, stateful_step_count=50)
class AgentWorkflowModel(RuleBasedStateMachine):
def __init__(self):
super().__init__()
self.agent = AgentHarness()
@initialize()
def start_clean(self) -> None:
self.agent.reset()
@precondition(lambda self: self.agent.state == 'idle')
@rule(goal=st.text(min_size=1, max_size=80))
def submit_goal(self, goal: str) -> None:
self.agent.submit_goal(goal)
@precondition(lambda self: self.agent.state == 'goal_received')
@rule(tool_name=st.sampled_from(['search_docs', 'fetch_ticket', 'quote_price']))
def request_tool(self, tool_name: str) -> None:
self.agent.request_tool(tool_name)
@precondition(lambda self: self.agent.state == 'tool_pending')
@rule(payload=st.dictionaries(
keys=st.text(min_size=1, max_size=12),
values=st.one_of(st.integers(), st.text(max_size=20)),
max_size=3,
))
def return_tool_result(self, payload: dict) -> None:
self.agent.return_tool_result(payload)
@precondition(lambda self: self.agent.state == 'goal_received')
@rule(answer=st.text(min_size=1, max_size=120))
def answer(self, answer: str) -> None:
self.agent.answer(answer)
@invariant()
def transcript_matches_state(self) -> None:
if self.agent.state == 'tool_pending':
assert self.agent.transcript[-1]['role'] == 'assistant'
assert 'tool_call' in self.agent.transcript[-1]
@invariant()
def no_messages_after_completion(self) -> None:
if self.agent.state == 'completed':
assert self.agent.transcript[-1]['role'] == 'assistant'
TestAgentWorkflow = AgentWorkflowModel.TestCaseWhy this works
- Rules define legal operations the generator can apply.
- Preconditions prevent impossible actions from wasting search effort.
- Invariants run after every step, so failures surface exactly where the contract breaks.
- Shrinking reduces a long random sequence to the smallest reproducible failure.
The two settings above matter in practice. Hypothesis defaults to 100 examples and a statefulstepcount of 50, which is a strong baseline for workflow discovery without making local test runs painful.
Step 3: Run generated sequences
Put the harness and state machine into your test suite and execute it with pytest -q. The -q flag is useful here because stateful tests can emit dense output once a failure shrinks.
pytest -qIn mature systems, add one more layer: compare the generated workflow against a second source of truth. That source can be:
- A simpler in-memory reference model.
- A policy validator that approves or rejects each transition.
- A transcript checker that enforces message ordering, tool pairing, or idempotency.
Add one high-value invariant for AI agents
If your agent can call external tools, start with sequencing safety. This single invariant catches many real-world defects:
@invariant()
def final_answer_requires_no_pending_tool(self) -> None:
if self.agent.state == 'completed':
assert all('tool_call' not in msg for msg in self.agent.transcript[-1:])For richer agents, extend the model with retry limits, human approval checkpoints, or memory snapshots. Keep the model abstract. If you reproduce every internal branch, your test model becomes as hard to maintain as the production code it is supposed to verify.
Verification and expected output
A passing run is intentionally boring. You want a clean suite and no contract violations.
Expected output
$ pytest -q
.
1 passedWhen the workflow is broken, Hypothesis will search, then shrink. That usually gives you a tiny counterexample such as:
- Submit goal
- Request tool
- Answer immediately
That is the real payoff of MBT for agents. Instead of reading a giant trace, you get the shortest sequence that proves the orchestration contract is invalid. Keep those shrunk sequences as regression tests once fixed.
What to verify beyond green tests
- The failure message points to a violated invariant, not a generic timeout.
- The harness transcript is readable enough to debug without replaying the entire stack.
- The same failure reproduces deterministically in CI after it is discovered.
Troubleshooting and what’s next
Troubleshooting: top 3 issues
- The generator rarely reaches the interesting states. Your model is too restrictive or your preconditions are blocking progress. Add a simpler path to
tool_pendingor loosen strategy ranges so more rules stay applicable. - Failures are noisy and hard to debug. Your invariants are too broad. Split one big invariant into smaller contract checks so the shrunk sequence tells you exactly what broke.
- The tests pass, but production still fails. Your harness is too fake. Move the model one layer closer to the real orchestration logic and assert on the real transcript shape, tool envelope, or retry behavior.
What’s next
- Add a reference model for memory updates so retries and resets cannot leak state.
- Track tool call IDs and assert that every tool result matches a prior request exactly once.
- Run the same MBT suite against multiple model backends to catch orchestration assumptions that only fail with one provider.
- Store shrunk failures as fixed regression cases beside your example-driven integration tests.
Once you have the first state machine working, MBT stops being a niche testing technique and becomes a workflow safety net. For AI agents, that is the level that matters: not whether one prompt looked good, but whether the entire sequence stayed valid under pressure.
Frequently Asked Questions
What is model-based testing for AI agents? +
How is MBT different from prompt testing or evals? +
Why use Hypothesis for stateful agent testing? +
RuleBasedStateMachine, preconditions, invariants, and shrinking. That combination is useful for agent workflows because it can discover a failing sequence and then reduce it to the shortest reproducible case.What should I model first in a tool-using agent? +
idle, goal_received, tool_pending, and completed, then assert that a final answer cannot occur while a tool result is still pending.Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.