Chaos Engineering for Autonomous Agent Swarms [2026]
Bottom Line
Treat every shared tool and handoff path in your swarm as a failure domain, then break those paths deliberately behind a proxy. If the swarm cannot degrade gracefully under latency, timeout, and hard-down faults, it is not production-ready.
Key Takeaways
- ›Proxy every shared tool call first; otherwise your chaos test only validates the happy path
- ›Use latency, timeout, and hard-down faults before exotic scenarios
- ›Track bounded retries, fallback mode, and final-agent completion as pass/fail signals
- ›OpenAI Agents SDK tracing is enabled by default, so compare clean vs degraded runs visually
- ›Scrub traces before sharing them outside engineering if prompts may contain sensitive data
Autonomous agent swarms fail differently from single-model apps. A slow shared tool can ripple across handoffs, retries can multiply cost, and one bad dependency can stall the whole workflow. Chaos engineering gives you a controlled way to break those paths before production does. In this tutorial, you will wire a small Python swarm to a fault proxy, inject deterministic failures, and verify that the system degrades instead of cascading.
Prerequisites
Bottom Line
The fastest way to harden an agent swarm is to force every critical tool call through a controllable proxy, then prove the swarm still finishes with bounded retries and explicit fallback behavior.
Prerequisites Box
- Python 3.11+ and an OpenAI API key.
- The OpenAI Agents SDK installed with pip install openai-agents, per the official quickstart.
- Docker available to run ghcr.io/shopify/toxiproxy:2.12.0, the current published container image from Shopify's registry.
- OpenTelemetry Python packages installed with pip install opentelemetry-api and pip install opentelemetry-sdk.
- A willingness to define one steady-state contract before testing anything: completion, latency budget, and acceptable fallback behavior.
If you plan to paste traces into tickets or incident docs, scrub secrets and customer identifiers first with TechBytes' Data Masking Tool.
Build a controllable swarm
Step 1: Put every shared tool behind a proxy
The first mistake teams make is attacking the model instead of the dependency graph. Start by forcing a shared tool through Toxiproxy. The project exposes a JSON HTTP API on port 8474, supports toxics like latency and timeout, and can disable a proxy by setting enabled to false. Shopify's docs also note that when you use Toxiproxy from the host, you should run the container with --net=host.
# terminal 1: tiny HTTP tool your agents depend on
python tool_server.py
# terminal 2: fault proxy
/docker run --rm --net=host ghcr.io/shopify/toxiproxy:2.12.0
# terminal 3: create a proxy from :8666 to the real service on :9000
curl -sS -X POST http://127.0.0.1:8474/populate \
-H "Content-Type: application/json" \
-d '[{"name":"policy_api","listen":"127.0.0.1:8666","upstream":"127.0.0.1:9000"}]'
# tool_server.py
from http.server import BaseHTTPRequestHandler, HTTPServer
import json
import time
class Handler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path != "/policy":
self.send_response(404)
self.end_headers()
return
time.sleep(0.05)
body = json.dumps({
"policy": "approve-low-risk",
"ttl_seconds": 30
}).encode()
self.send_response(200)
self.send_header("Content-Type", "application/json")
self.send_header("Content-Length", str(len(body)))
self.end_headers()
self.wfile.write(body)
HTTPServer(("127.0.0.1", 9000), Handler).serve_forever()
This setup matters because you never want agents talking directly to a shared service during a chaos run. If traffic can bypass the proxy, your experiment is invalid.
Step 2: Build the swarm around handoffs and graceful fallback
The OpenAI Agents SDK supports handoffs, and its docs describe them as tools exposed to the model. That makes them a natural choke point for swarm testing: a planner can delegate, call a tool, and still finish the run when the tool path degrades. The code below adds a proxied tool, a fallback mode, and a visible completion signal with Runner.run and result.last_agent.name.
# swarm.py
import asyncio
import json
from urllib.request import urlopen
from agents import Agent, Runner, function_tool
from opentelemetry import metrics, trace
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter, PeriodicExportingMetricReader
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
trace_provider = TracerProvider()
trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(trace_provider)
tracer = trace.get_tracer("swarm.chaos")
metric_reader = PeriodicExportingMetricReader(ConsoleMetricExporter())
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))
meter = metrics.get_meter("swarm.chaos")
degraded_counter = meter.create_counter("swarm.policy.degraded", unit="1")
LAST_POLICY_MODE = "unknown"
@function_tool
def fetch_policy() -> str:
global LAST_POLICY_MODE
with tracer.start_as_current_span("fetch_policy") as span:
try:
with urlopen("http://127.0.0.1:8666/policy", timeout=1.2) as resp:
body = json.loads(resp.read().decode())
LAST_POLICY_MODE = "live"
span.set_attribute("policy.mode", "live")
return json.dumps({"mode": "live", **body})
except Exception as exc:
LAST_POLICY_MODE = "fallback"
span.set_attribute("policy.mode", "fallback")
span.record_exception(exc)
degraded_counter.add(1, {"tool": "policy_api"})
return json.dumps({
"mode": "fallback",
"policy": "manual-review",
"ttl_seconds": 5,
"reason": "policy_api_unavailable"
})
planner = Agent(
name="Planner",
model="gpt-4.1",
handoff_description="Creates the plan and checks policy before risky actions.",
instructions=(
"Break work into 2 or 3 actions. Always call fetch_policy before "
"proposing risky automation. If policy mode is fallback, reduce scope "
"and say the swarm degraded gracefully."
),
tools=[fetch_policy],
)
verifier = Agent(
name="Verifier",
model="gpt-4.1",
handoff_description="Final safety and completion check.",
instructions="Confirm the answer is bounded, explicit about fallback, and safe to ship."
)
triage = Agent(
name="Triage",
instructions="Route planning work to Planner and final review to Verifier.",
handoffs=[planner, verifier],
)
async def main():
result = await Runner.run(
triage,
"Plan a maintenance action for 200 agents and summarize the safe path."
)
print(f"answered_by={result.last_agent.name}")
print(f"policy_mode={LAST_POLICY_MODE}")
print(result.final_output)
if __name__ == "__main__":
asyncio.run(main())
Instrument and inject faults
Step 3: Define a steady-state contract before the blast radius changes
Chaos engineering is not random failure theater. Decide what a successful degraded run looks like, then test only against that contract. For this swarm, a solid first contract is:
- The run still completes with a final answer from the swarm.
- The policy tool either returns live data or an explicit fallback payload.
- Retries are bounded; no loop should keep calling the same dead tool indefinitely.
- The response tells the operator when degraded mode changed the recommendation.
Those checks are more useful than a raw success rate because they validate control, not just survival.
Step 4: Inject latency, timeout, and hard-down faults
Toxiproxy supports JSON endpoints for creating toxics on a proxy and a POST /reset endpoint to clear them. Start with one fault at a time so you can attribute behavior cleanly.
# baseline
python swarm.py
# add 1200ms downstream latency with jitter
curl -sS -X POST http://127.0.0.1:8474/proxies/policy_api/toxics \
-H "Content-Type: application/json" \
-d '{"name":"latency_downstream","type":"latency","stream":"downstream","attributes":{"latency":1200,"jitter":200}}'
python swarm.py
# replace with a timeout fault
curl -sS -X POST http://127.0.0.1:8474/reset
curl -sS -X POST http://127.0.0.1:8474/proxies/policy_api/toxics \
-H "Content-Type: application/json" \
-d '{"name":"blackhole_downstream","type":"timeout","stream":"downstream","attributes":{"timeout":0}}'
python swarm.py
# hard-down the dependency
curl -sS -X POST http://127.0.0.1:8474/proxies/policy_api \
-H "Content-Type: application/json" \
-d '{"enabled":false}'
python swarm.py
# clean slate between experiments
curl -sS -X POST http://127.0.0.1:8474/reset
Latency tells you whether handoffs compound slowness. Timeout tells you whether the swarm waits forever. Hard-down tells you whether your fallback path is real or only theoretical.
Verification and expected output
Run the baseline first, then compare each faulted run against the same contract. In the OpenAI trace viewer, you should see the same high-level workflow with different tool outcomes, not a completely different control path.
- Baseline run:
policy_mode=liveand the final plan uses the live policy. - Latency run: completion still succeeds, but the tool span is visibly slower.
- Timeout or hard-down run:
policy_mode=fallbackand the answer becomes more conservative. - Any run that loops, hangs, or hides degraded mode is a failed experiment.
# representative console output
answered_by=Planner
policy_mode=fallback
The swarm degraded gracefully because the policy service was unavailable.
Proceed with a reduced-scope maintenance window, require manual review,
and avoid autonomous rollout to all 200 agents in one batch.
That output is what you want: the swarm still finishes, narrows scope, and tells the operator exactly why.
Troubleshooting
- The proxy exists, but nothing changes during faults. Your app is probably bypassing the proxy. Confirm every agent tool call points at
127.0.0.1:8666, not the upstream service on127.0.0.1:9000. - The Toxiproxy container starts, but the host app cannot reach it. Recheck your networking mode. Shopify's docs explicitly call out --net=host for host-based usage; if your environment cannot use host networking, move both the app and proxy into the same container network.
- The swarm completes, but hides the failure. That is still a bug. Fallback must be observable in output, metrics, or traces; otherwise on-call engineers cannot distinguish graceful degradation from silent correctness loss.
What's next
- Add one more proxied dependency, such as retrieval or memory storage, and repeat the same experiments independently.
- Turn your steady-state contract into CI assertions so every merge exercises at least one degraded path.
- Schedule a monthly game day where you combine one network fault with one orchestration fault, such as a delayed handoff or a verifier failure.
- Once the workflow stabilizes, format your experiment snippets and runbooks with a consistent house style using TechBytes' Code Formatter.
The core pattern scales: proxy the dependency, define the contract, inject one fault, and require the swarm to stay bounded. If you do only that, your agents will already be more production-ready than most so-called autonomous systems shipping today.
Frequently Asked Questions
How is chaos engineering for agent swarms different from normal microservice chaos testing? +
What should I break first in an autonomous agent swarm? +
Do I need to put every tool behind Toxiproxy? +
How do I know an agent swarm passed a chaos experiment? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.