Home Posts Chaos Engineering for Autonomous Agent Swarms [2026]
AI Engineering

Chaos Engineering for Autonomous Agent Swarms [2026]

Chaos Engineering for Autonomous Agent Swarms [2026]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · May 10, 2026 · 8 min read

Bottom Line

Treat every shared tool and handoff path in your swarm as a failure domain, then break those paths deliberately behind a proxy. If the swarm cannot degrade gracefully under latency, timeout, and hard-down faults, it is not production-ready.

Key Takeaways

  • Proxy every shared tool call first; otherwise your chaos test only validates the happy path
  • Use latency, timeout, and hard-down faults before exotic scenarios
  • Track bounded retries, fallback mode, and final-agent completion as pass/fail signals
  • OpenAI Agents SDK tracing is enabled by default, so compare clean vs degraded runs visually
  • Scrub traces before sharing them outside engineering if prompts may contain sensitive data

Autonomous agent swarms fail differently from single-model apps. A slow shared tool can ripple across handoffs, retries can multiply cost, and one bad dependency can stall the whole workflow. Chaos engineering gives you a controlled way to break those paths before production does. In this tutorial, you will wire a small Python swarm to a fault proxy, inject deterministic failures, and verify that the system degrades instead of cascading.

Prerequisites

Bottom Line

The fastest way to harden an agent swarm is to force every critical tool call through a controllable proxy, then prove the swarm still finishes with bounded retries and explicit fallback behavior.

Prerequisites Box

  • Python 3.11+ and an OpenAI API key.
  • The OpenAI Agents SDK installed with pip install openai-agents, per the official quickstart.
  • Docker available to run ghcr.io/shopify/toxiproxy:2.12.0, the current published container image from Shopify's registry.
  • OpenTelemetry Python packages installed with pip install opentelemetry-api and pip install opentelemetry-sdk.
  • A willingness to define one steady-state contract before testing anything: completion, latency budget, and acceptable fallback behavior.

If you plan to paste traces into tickets or incident docs, scrub secrets and customer identifiers first with TechBytes' Data Masking Tool.

Build a controllable swarm

Step 1: Put every shared tool behind a proxy

The first mistake teams make is attacking the model instead of the dependency graph. Start by forcing a shared tool through Toxiproxy. The project exposes a JSON HTTP API on port 8474, supports toxics like latency and timeout, and can disable a proxy by setting enabled to false. Shopify's docs also note that when you use Toxiproxy from the host, you should run the container with --net=host.

# terminal 1: tiny HTTP tool your agents depend on
python tool_server.py

# terminal 2: fault proxy
/docker run --rm --net=host ghcr.io/shopify/toxiproxy:2.12.0

# terminal 3: create a proxy from :8666 to the real service on :9000
curl -sS -X POST http://127.0.0.1:8474/populate \
  -H "Content-Type: application/json" \
  -d '[{"name":"policy_api","listen":"127.0.0.1:8666","upstream":"127.0.0.1:9000"}]'
# tool_server.py
from http.server import BaseHTTPRequestHandler, HTTPServer
import json
import time

class Handler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path != "/policy":
            self.send_response(404)
            self.end_headers()
            return

        time.sleep(0.05)
        body = json.dumps({
            "policy": "approve-low-risk",
            "ttl_seconds": 30
        }).encode()

        self.send_response(200)
        self.send_header("Content-Type", "application/json")
        self.send_header("Content-Length", str(len(body)))
        self.end_headers()
        self.wfile.write(body)

HTTPServer(("127.0.0.1", 9000), Handler).serve_forever()

This setup matters because you never want agents talking directly to a shared service during a chaos run. If traffic can bypass the proxy, your experiment is invalid.

Step 2: Build the swarm around handoffs and graceful fallback

The OpenAI Agents SDK supports handoffs, and its docs describe them as tools exposed to the model. That makes them a natural choke point for swarm testing: a planner can delegate, call a tool, and still finish the run when the tool path degrades. The code below adds a proxied tool, a fallback mode, and a visible completion signal with Runner.run and result.last_agent.name.

# swarm.py
import asyncio
import json
from urllib.request import urlopen

from agents import Agent, Runner, function_tool
from opentelemetry import metrics, trace
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter, PeriodicExportingMetricReader
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter

trace_provider = TracerProvider()
trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(trace_provider)
tracer = trace.get_tracer("swarm.chaos")

metric_reader = PeriodicExportingMetricReader(ConsoleMetricExporter())
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))
meter = metrics.get_meter("swarm.chaos")
degraded_counter = meter.create_counter("swarm.policy.degraded", unit="1")

LAST_POLICY_MODE = "unknown"

@function_tool
def fetch_policy() -> str:
    global LAST_POLICY_MODE
    with tracer.start_as_current_span("fetch_policy") as span:
        try:
            with urlopen("http://127.0.0.1:8666/policy", timeout=1.2) as resp:
                body = json.loads(resp.read().decode())
                LAST_POLICY_MODE = "live"
                span.set_attribute("policy.mode", "live")
                return json.dumps({"mode": "live", **body})
        except Exception as exc:
            LAST_POLICY_MODE = "fallback"
            span.set_attribute("policy.mode", "fallback")
            span.record_exception(exc)
            degraded_counter.add(1, {"tool": "policy_api"})
            return json.dumps({
                "mode": "fallback",
                "policy": "manual-review",
                "ttl_seconds": 5,
                "reason": "policy_api_unavailable"
            })

planner = Agent(
    name="Planner",
    model="gpt-4.1",
    handoff_description="Creates the plan and checks policy before risky actions.",
    instructions=(
        "Break work into 2 or 3 actions. Always call fetch_policy before "
        "proposing risky automation. If policy mode is fallback, reduce scope "
        "and say the swarm degraded gracefully."
    ),
    tools=[fetch_policy],
)

verifier = Agent(
    name="Verifier",
    model="gpt-4.1",
    handoff_description="Final safety and completion check.",
    instructions="Confirm the answer is bounded, explicit about fallback, and safe to ship."
)

triage = Agent(
    name="Triage",
    instructions="Route planning work to Planner and final review to Verifier.",
    handoffs=[planner, verifier],
)

async def main():
    result = await Runner.run(
        triage,
        "Plan a maintenance action for 200 agents and summarize the safe path."
    )
    print(f"answered_by={result.last_agent.name}")
    print(f"policy_mode={LAST_POLICY_MODE}")
    print(result.final_output)

if __name__ == "__main__":
    asyncio.run(main())
Pro tip: OpenAI Agents SDK tracing is enabled by default, so you already have run-level visibility for generations, tool calls, and handoffs. Use the SDK trace view for workflow shape, and use OpenTelemetry for your own counters and SLO signals.

Instrument and inject faults

Step 3: Define a steady-state contract before the blast radius changes

Chaos engineering is not random failure theater. Decide what a successful degraded run looks like, then test only against that contract. For this swarm, a solid first contract is:

  1. The run still completes with a final answer from the swarm.
  2. The policy tool either returns live data or an explicit fallback payload.
  3. Retries are bounded; no loop should keep calling the same dead tool indefinitely.
  4. The response tells the operator when degraded mode changed the recommendation.

Those checks are more useful than a raw success rate because they validate control, not just survival.

Step 4: Inject latency, timeout, and hard-down faults

Toxiproxy supports JSON endpoints for creating toxics on a proxy and a POST /reset endpoint to clear them. Start with one fault at a time so you can attribute behavior cleanly.

# baseline
python swarm.py

# add 1200ms downstream latency with jitter
curl -sS -X POST http://127.0.0.1:8474/proxies/policy_api/toxics \
  -H "Content-Type: application/json" \
  -d '{"name":"latency_downstream","type":"latency","stream":"downstream","attributes":{"latency":1200,"jitter":200}}'
python swarm.py

# replace with a timeout fault
curl -sS -X POST http://127.0.0.1:8474/reset
curl -sS -X POST http://127.0.0.1:8474/proxies/policy_api/toxics \
  -H "Content-Type: application/json" \
  -d '{"name":"blackhole_downstream","type":"timeout","stream":"downstream","attributes":{"timeout":0}}'
python swarm.py

# hard-down the dependency
curl -sS -X POST http://127.0.0.1:8474/proxies/policy_api \
  -H "Content-Type: application/json" \
  -d '{"enabled":false}'
python swarm.py

# clean slate between experiments
curl -sS -X POST http://127.0.0.1:8474/reset

Latency tells you whether handoffs compound slowness. Timeout tells you whether the swarm waits forever. Hard-down tells you whether your fallback path is real or only theoretical.

Watch out: If your agents retry inside the tool layer and again inside the orchestration layer, one dead service can become a retry storm. Cap the retry budget in one place and emit a metric every time degraded mode triggers.

Verification and expected output

Run the baseline first, then compare each faulted run against the same contract. In the OpenAI trace viewer, you should see the same high-level workflow with different tool outcomes, not a completely different control path.

  • Baseline run: policy_mode=live and the final plan uses the live policy.
  • Latency run: completion still succeeds, but the tool span is visibly slower.
  • Timeout or hard-down run: policy_mode=fallback and the answer becomes more conservative.
  • Any run that loops, hangs, or hides degraded mode is a failed experiment.
# representative console output
answered_by=Planner
policy_mode=fallback
The swarm degraded gracefully because the policy service was unavailable.
Proceed with a reduced-scope maintenance window, require manual review,
and avoid autonomous rollout to all 200 agents in one batch.

That output is what you want: the swarm still finishes, narrows scope, and tells the operator exactly why.

Troubleshooting

  1. The proxy exists, but nothing changes during faults. Your app is probably bypassing the proxy. Confirm every agent tool call points at 127.0.0.1:8666, not the upstream service on 127.0.0.1:9000.
  2. The Toxiproxy container starts, but the host app cannot reach it. Recheck your networking mode. Shopify's docs explicitly call out --net=host for host-based usage; if your environment cannot use host networking, move both the app and proxy into the same container network.
  3. The swarm completes, but hides the failure. That is still a bug. Fallback must be observable in output, metrics, or traces; otherwise on-call engineers cannot distinguish graceful degradation from silent correctness loss.

What's next

  • Add one more proxied dependency, such as retrieval or memory storage, and repeat the same experiments independently.
  • Turn your steady-state contract into CI assertions so every merge exercises at least one degraded path.
  • Schedule a monthly game day where you combine one network fault with one orchestration fault, such as a delayed handoff or a verifier failure.
  • Once the workflow stabilizes, format your experiment snippets and runbooks with a consistent house style using TechBytes' Code Formatter.

The core pattern scales: proxy the dependency, define the contract, inject one fault, and require the swarm to stay bounded. If you do only that, your agents will already be more production-ready than most so-called autonomous systems shipping today.

Frequently Asked Questions

How is chaos engineering for agent swarms different from normal microservice chaos testing? +
Agent swarms add two failure amplifiers: handoffs and model-driven retries. A slow or dead tool does not just fail one request; it can distort routing, inflate token spend, and create hidden loops across multiple agents.
What should I break first in an autonomous agent swarm? +
Start with the shared dependencies every agent touches: retrieval, policy checks, memory stores, and outbound APIs. Test latency, timeout, and hard-down faults first because they reveal most control-plane bugs quickly.
Do I need to put every tool behind Toxiproxy? +
Not every tool on day one, but every critical shared tool should eventually be proxied in test or staging. If a dependency can change the swarm's plan, block execution, or trigger retries, it belongs in the fault harness.
How do I know an agent swarm passed a chaos experiment? +
A pass means the swarm still finishes, stays within a bounded retry budget, and makes degraded mode explicit in output or telemetry. If it hangs, loops, silently changes correctness, or bypasses the proxy, the experiment failed.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.