What is model-based testing for AI agents?

Model-based testing treats the agent workflow as a state machine with legal actions and invariants. Instead of writing one prompt-response test at a time, you define the workflow contract and let the test engine generate many action sequences automatically.

How is MBT different from prompt testing or evals?

Prompt tests and evals usually score a single input-output interaction. MBT focuses on sequence correctness: tool ordering, retries, memory resets, and final-answer safety across multiple steps.

Why use Hypothesis for stateful agent testing?

Hypothesis provides RuleBasedStateMachine, preconditions, invariants, and shrinking. That combination is useful for agent workflows because it can discover a failing sequence and then reduce it to the shortest reproducible case.

What should I model first in a tool-using agent?

Start with the workflow boundary, not the whole stack. Model states such as idle, goal_received, tool_pending, and completed, then assert that a final answer cannot occur while a tool result is still pending.

Model-Based Testing for AI Agents [Deep Dive 2026]

Complex AI agents fail less often because a single prompt is wrong and more often because a sequence is wrong: a tool result arrives late, the planner answers before validation, or memory mutates across retries. Model-Based Testing fixes that by testing the workflow as a state machine. In this guide, you’ll use Python, Hypothesis stateful testing, and pytest to generate realistic action sequences and catch invalid transitions before they land in production.

Prerequisites

What you need

Python 3 with pytest and hypothesis installed.
An agent orchestration layer with explicit operations such as submit goal, request tool, receive tool output, and answer.
A place to store transcripts or traces from test runs for debugging and shrinking failures.
Redacted fixtures if you replay real conversations; use the Data Masking Tool before committing prompts or tool payloads.

Bottom Line

The winning pattern is simple: keep a small abstract model, drive the real agent through a thin harness, and assert invariants after every generated step. That gives you broad workflow coverage without hand-authoring dozens of brittle end-to-end cases.

Before writing any code, decide what counts as a workflow bug. For most tool-using agents, that list is short and concrete:

A final answer is emitted while a tool call is still unresolved.
A tool result is accepted even though no tool was requested.
Memory or planner state leaks across a reset or retry.
The same tool is called twice when the workflow contract allows only one call.

Step 1: Model the workflow

Start with the smallest state model that still exposes the agent’s contract. Do not mirror every internal class. Model only what determines legal next steps.

Pick the states and transitions

idle: no user goal yet.
goal_received: the agent may plan, request a tool, or answer directly.
tool_pending: the agent must wait for tool output.
completed: the final answer is sent and no further action is legal.

Then encode a minimal harness. This adapter should call your real orchestration entry points, while exposing just enough state for the test model.

from dataclasses import dataclass, field

@dataclass
class AgentHarness:
    state: str = 'idle'
    last_tool: str | None = None
    transcript: list[dict] = field(default_factory=list)

    def submit_goal(self, text: str) -> None:
        assert self.state == 'idle'
        self.state = 'goal_received'
        self.transcript.append({'role': 'user', 'text': text})

    def request_tool(self, tool_name: str) -> None:
        assert self.state == 'goal_received'
        self.state = 'tool_pending'
        self.last_tool = tool_name
        self.transcript.append({'role': 'assistant', 'tool_call': tool_name})

    def return_tool_result(self, payload: dict) -> None:
        assert self.state == 'tool_pending'
        self.state = 'goal_received'
        self.transcript.append({'role': 'tool', 'name': self.last_tool, 'payload': payload})

    def answer(self, text: str) -> None:
        assert self.state == 'goal_received'
        self.state = 'completed'
        self.transcript.append({'role': 'assistant', 'text': text})

    def reset(self) -> None:
        self.state = 'idle'
        self.last_tool = None
        self.transcript.clear()

Pro tip: If your agent uses the OpenAI Responses API with function calling, model tool request and tool result as separate transitions. That is where most sequencing bugs hide.

Step 2: Build the state machine

Now let Hypothesis generate whole workflows instead of single inputs. Its RuleBasedStateMachine abstraction is the right fit because it chooses both values and action order.

Encode actions as rules

import hypothesis.strategies as st
from hypothesis import settings
from hypothesis.stateful import RuleBasedStateMachine, initialize, invariant, precondition, rule

@settings(max_examples=100, stateful_step_count=50)
class AgentWorkflowModel(RuleBasedStateMachine):
    def __init__(self):
        super().__init__()
        self.agent = AgentHarness()

    @initialize()
    def start_clean(self) -> None:
        self.agent.reset()

    @precondition(lambda self: self.agent.state == 'idle')
    @rule(goal=st.text(min_size=1, max_size=80))
    def submit_goal(self, goal: str) -> None:
        self.agent.submit_goal(goal)

    @precondition(lambda self: self.agent.state == 'goal_received')
    @rule(tool_name=st.sampled_from(['search_docs', 'fetch_ticket', 'quote_price']))
    def request_tool(self, tool_name: str) -> None:
        self.agent.request_tool(tool_name)

    @precondition(lambda self: self.agent.state == 'tool_pending')
    @rule(payload=st.dictionaries(
        keys=st.text(min_size=1, max_size=12),
        values=st.one_of(st.integers(), st.text(max_size=20)),
        max_size=3,
    ))
    def return_tool_result(self, payload: dict) -> None:
        self.agent.return_tool_result(payload)

    @precondition(lambda self: self.agent.state == 'goal_received')
    @rule(answer=st.text(min_size=1, max_size=120))
    def answer(self, answer: str) -> None:
        self.agent.answer(answer)

    @invariant()
    def transcript_matches_state(self) -> None:
        if self.agent.state == 'tool_pending':
            assert self.agent.transcript[-1]['role'] == 'assistant'
            assert 'tool_call' in self.agent.transcript[-1]

    @invariant()
    def no_messages_after_completion(self) -> None:
        if self.agent.state == 'completed':
            assert self.agent.transcript[-1]['role'] == 'assistant'

TestAgentWorkflow = AgentWorkflowModel.TestCase

Why this works

Rules define legal operations the generator can apply.
Preconditions prevent impossible actions from wasting search effort.
Invariants run after every step, so failures surface exactly where the contract breaks.
Shrinking reduces a long random sequence to the smallest reproducible failure.

The two settings above matter in practice. Hypothesis defaults to 100 examples and a statefulstepcount of 50, which is a strong baseline for workflow discovery without making local test runs painful.

Step 3: Run generated sequences

Put the harness and state machine into your test suite and execute it with pytest -q. The -q flag is useful here because stateful tests can emit dense output once a failure shrinks.

pytest -q

In mature systems, add one more layer: compare the generated workflow against a second source of truth. That source can be:

A simpler in-memory reference model.
A policy validator that approves or rejects each transition.
A transcript checker that enforces message ordering, tool pairing, or idempotency.

Add one high-value invariant for AI agents

If your agent can call external tools, start with sequencing safety. This single invariant catches many real-world defects:

@invariant()
def final_answer_requires_no_pending_tool(self) -> None:
    if self.agent.state == 'completed':
        assert all('tool_call' not in msg for msg in self.agent.transcript[-1:])

For richer agents, extend the model with retry limits, human approval checkpoints, or memory snapshots. Keep the model abstract. If you reproduce every internal branch, your test model becomes as hard to maintain as the production code it is supposed to verify.

Watch out: Do not hide workflow bugs behind mocks that always succeed. Your harness should hit the real planner, state reducer, or tool-routing layer, even if external APIs are stubbed.

Verification and expected output

A passing run is intentionally boring. You want a clean suite and no contract violations.

Expected output

$ pytest -q
.
1 passed

When the workflow is broken, Hypothesis will search, then shrink. That usually gives you a tiny counterexample such as:

Submit goal
Request tool
Answer immediately

That is the real payoff of MBT for agents. Instead of reading a giant trace, you get the shortest sequence that proves the orchestration contract is invalid. Keep those shrunk sequences as regression tests once fixed.

What to verify beyond green tests

The failure message points to a violated invariant, not a generic timeout.
The harness transcript is readable enough to debug without replaying the entire stack.
The same failure reproduces deterministically in CI after it is discovered.

Troubleshooting and what’s next

Troubleshooting: top 3 issues

The generator rarely reaches the interesting states. Your model is too restrictive or your preconditions are blocking progress. Add a simpler path to tool_pending or loosen strategy ranges so more rules stay applicable.
Failures are noisy and hard to debug. Your invariants are too broad. Split one big invariant into smaller contract checks so the shrunk sequence tells you exactly what broke.
The tests pass, but production still fails. Your harness is too fake. Move the model one layer closer to the real orchestration logic and assert on the real transcript shape, tool envelope, or retry behavior.

What’s next

Add a reference model for memory updates so retries and resets cannot leak state.
Track tool call IDs and assert that every tool result matches a prior request exactly once.
Run the same MBT suite against multiple model backends to catch orchestration assumptions that only fail with one provider.
Store shrunk failures as fixed regression cases beside your example-driven integration tests.

Once you have the first state machine working, MBT stops being a niche testing technique and becomes a workflow safety net. For AI agents, that is the level that matters: not whether one prompt looked good, but whether the entire sequence stayed valid under pressure.

Model-Based Testing for AI Agents [Deep Dive 2026]

Bottom Line

Prerequisites

What you need

Bottom Line

Step 1: Model the workflow

Pick the states and transitions

Step 2: Build the state machine

Encode actions as rules

Why this works

Step 3: Run generated sequences

Add one high-value invariant for AI agents

Verification and expected output

Expected output

What to verify beyond green tests

Troubleshooting and what’s next

Troubleshooting: top 3 issues

What’s next

Frequently Asked Questions

Get Engineering Deep-Dives in Your Inbox