When should I split one agent into multiple agents?

Split only after a single agent starts failing because prompt logic is too conditional, tool choices overlap, or evaluation results plateau. In other words, use multiple agents to reduce ambiguity, not to make the diagram look more advanced.

What is the difference between handoffs and agents as tools?

With agents as tools, a manager stays in control and calls specialists for bounded work. With handoffs, a triage agent transfers control so the specialist becomes the active agent for that part of the interaction.

How do I test a multi-agent workflow before production?

Start with a fixed eval set of real prompts and compare results after every routing or prompt change. Then inspect traces to confirm the system used the right specialists in the right order instead of just landing on a good answer by chance.

How do I keep agents from leaking sensitive data in logs or traces?

Redact secrets, personal data, and internal identifiers before they enter prompts, eval fixtures, or review queues. Treat traces as production artifacts, and sanitize them with policy-based filters or a redaction step such as a data masking pass.

Agentic Workflows: Build Multi-Agent AI Systems [2026]

As of May 15, 2026, the cleanest code-first path for agentic systems is to treat orchestration as an engineering problem, not a prompt-writing contest. Multi-agent apps work when each agent has a narrow contract, one component owns the final answer, and every run is observable. In this tutorial, you will build a practical multi-agent workflow with the OpenAI Agents SDK, then add verification, parallel branches, and production-minded guardrails.

agents as tools is the best default when one agent should control policy, output shape, and user experience.
handoffs are better when routing itself is part of the workflow and the specialist should take over.
Parallel branches speed up research and validation, but they still need one merge point.
Redaction is part of orchestration: sanitize logs and eval sets with the Data Masking Tool before production review.

Prerequisites

What you need

Python 3.10+ and a fresh virtual environment.
An OpenAI API key in OPENAI_API_KEY.
A task that naturally decomposes into roles such as research, drafting, checking, or routing.
A short eval set of 10 to 20 real tasks so you can compare workflow changes before shipping.

Bottom Line

Start with one orchestrator, add specialists only where prompts or tools are colliding, and make verification a first-class stage. Multi-agent systems fail more from unclear ownership than from model quality.

OpenAI's current guidance is consistent on one point: maximize a single agent first, then split when prompt logic, tool overlap, or routing ambiguity starts hurting reliability. In practice, that means you should not create five agents just because the problem has five nouns in it.

Step 1: Pick an orchestrator

Choose the right collaboration pattern

Use agents as tools when one manager should keep control, combine outputs, and enforce shared rules.
Use handoffs when a triage agent should route to a specialist that owns the next reply.
Mix both patterns when the system needs coarse routing first, then bounded sub-tasks inside the specialist.

For most engineering workflows, start with a manager pattern. It is easier to test because one agent owns the final answer, and you can inspect whether the manager used the right tools in the right order.

mkdir agentic-demo
cd agentic-demo
python -m venv .venv
source .venv/bin/activate
pip install openai-agents
export OPENAI_API_KEY='sk-...'

The SDK quickstart uses pip install openai-agents, then runs agents through Runner.run(). That gives you a stable baseline before you add sessions, guardrails, or human review.

Step 2: Build specialists

Keep each agent painfully narrow

The easiest way to create a brittle system is to make every specialist half-overlap with every other specialist. Define roles by outcome, not by org chart. A researcher gathers facts. A writer produces a draft. A reviewer checks for gaps and risky claims.

import asyncio
from agents import Agent, Runner

researcher = Agent(
    name='Researcher',
    instructions='Collect facts, constraints, and unknowns. Return concise bullet points.'
)

writer = Agent(
    name='Writer',
    instructions='Turn the provided notes into a clear engineering brief with steps and caveats.'
)

reviewer = Agent(
    name='Reviewer',
    instructions='Check the draft for unsupported claims, missing steps, and operational risk.'
)

orchestrator = Agent(
    name='Orchestrator',
    instructions=(
        'Own the final answer. Break the task into research, drafting, and review. '
        'Use the specialist tools when needed, then return one cohesive response.'
    ),
    tools=[
        researcher.as_tool(
            tool_name='research_facts',
            tool_description='Collect facts and constraints for the task.'
        ),
        writer.as_tool(
            tool_name='draft_brief',
            tool_description='Convert notes into a draft.'
        ),
        reviewer.as_tool(
            tool_name='review_output',
            tool_description='Check the draft for gaps, unsupported claims, and risk.'
        ),
    ],
)

async def main():
    result = await Runner.run(
        orchestrator,
        'Design a release-note workflow for a Kubernetes operator project.'
    )
    print(result.final_output)

if __name__ == '__main__':
    asyncio.run(main())

This pattern matches the SDK's documented Agent.as_tool() workflow: the manager keeps the conversation, while specialists handle bounded sub-tasks. That is usually what you want in internal engineering systems, because output format and policy stay centralized.

When to switch to handoffs

If your triage layer should route the user directly to a domain specialist, use handoffs instead of tool calls.

from agents import Agent

billing_agent = Agent(name='Billing Agent', instructions='Handle billing issues only.')
security_agent = Agent(name='Security Agent', instructions='Handle security issues only.')

triage_agent = Agent(
    name='Triage Agent',
    instructions='Route each request to the right specialist.',
    handoffs=[billing_agent, security_agent],
)

That design is simpler than a manager when the chosen specialist should speak directly and there is little value in a central narrator.

Step 3: Add verification and parallelism

Parallelize only independent work

The orchestration guide explicitly calls out asyncio.gather for parallel agent execution. Use it for branches that do not depend on one another, then merge those branches through one final reviewer or manager.

import asyncio
from agents import Runner

async def run_branch(agent, prompt):
    result = await Runner.run(agent, prompt)
    return result.final_output

async def build_plan(user_request):
    facts_task = run_branch(researcher, f'List implementation facts for: {user_request}')
    risks_task = run_branch(reviewer, f'List operational risks for: {user_request}')

    facts, risks = await asyncio.gather(facts_task, risks_task)

    final = await Runner.run(
        writer,
        f'Use these facts:\n{facts}\n\nUse these risks:\n{risks}\n\nWrite an implementation plan.'
    )
    return final.final_output

Add a verifier before you trust the output

Check whether every required step was covered.
Check whether the output cites assumptions instead of inventing facts.
Check whether sensitive values leaked into prompts, traces, or examples.

This is also where your eval set matters. Run the same 10 to 20 tasks every time you change prompts, tools, or routing logic. If quality improves on one demo prompt but regresses on the rest, the workflow did not actually improve.

Pro tip: Keep specialist outputs machine-friendly. Short bullets, structured fields, and explicit assumptions are easier for downstream agents to combine and verify.

Verify and troubleshoot

Expected output

A healthy run should produce one final answer, plus evidence that the right specialists were used. In the SDK quickstart, you can also inspect which agent finished the run through result.last_agent.name.

result = await Runner.run(triage_agent, 'Review our refund workflow for security gaps')
print(result.final_output)
print(f'Answered by: {result.last_agent.name}')

The final output should read like one coherent answer, not a dump of tool-call fragments.
The trace should show a small number of deliberate steps, not repetitive looping.
If you used handoffs, the finishing agent should match the specialist you expected.

Troubleshooting: top 3 failures

Overlapping specialists: If research, writing, and review all perform the same work, the manager will make noisy choices. Rewrite agent instructions so each tool has one obvious job.
Unstable routing: If the wrong agent is selected, improve tool names and descriptions before adding more prompt text. Routing quality often improves more from cleaner tool contracts than from longer instructions.
Sensitive data in traces: If logs contain customer data, redact inputs before runs and before human review. For quick cleanup of examples and fixtures, use the Data Masking Tool.

Watch out: Do not parallelize dependent tasks just because you can. If one branch needs the output of another, forced parallelism adds cost and usually lowers quality.

What's next

Add sessions or server-managed conversation state when the workflow spans multiple turns.
Add guardrails or human review for high-risk actions such as code execution, deployments, or data access.
Promote your fixed task list into a lightweight eval harness so every prompt change is measurable.
Standardize generated snippets before publishing internal docs with the Code Formatter.

The core lesson is simple: orchestration is a systems design problem. When ownership is explicit, specialists are narrow, and every run is observable, multi-agent AI stops feeling magical and starts behaving like software you can debug.

Agentic Workflows: Build Multi-Agent AI Systems [2026]

Bottom Line

Prerequisites

What you need

Bottom Line

Step 1: Pick an orchestrator

Choose the right collaboration pattern

Step 2: Build specialists

Keep each agent painfully narrow

When to switch to handoffs

Step 3: Add verification and parallelism

Parallelize only independent work

Add a verifier before you trust the output

Verify and troubleshoot

Expected output

Troubleshooting: top 3 failures

What's next

Frequently Asked Questions

Get Engineering Deep-Dives in Your Inbox