Chapter 11 · Multi-Agent Systems

When one agent isn't enough — and the hazards of adding more.

A multi-agent system is a collection of agents that coordinate to solve a task that a single agent could not, or could not cleanly. Think specialists, not clones.

In plain English. A multi-agent system is the difference between a generalist intern and a small startup with founders, an engineer, a designer, and an ops person. Each role does less, but the team does more.

Patterns at a glance

flowchart TB
    subgraph A1[Single agent]
    SA[Agent]
    end
    subgraph A2[Orchestrator + workers]
    O[Orchestrator] --> W1[SQL]
    O --> W2[Code]
    O --> W3[Search]
    end
    subgraph A3[Hierarchical]
    P[Planner] --> M1[Manager A] --> E1[Executor]
    P --> M2[Manager B] --> E2[Executor]
    end
    subgraph A4[Critic / debate]
    G[Generator] --> C[Critic]
    C --> G
    C --> J[Judge]
    end
    subgraph A5[Swarm / blackboard]
    B[(Shared blackboard)]
    Ag1[Agent] <--> B
    Ag2[Agent] <--> B
    Ag3[Agent] <--> B
    end

In production, you mostly see single agent and orchestrator + workers. The other patterns are real, but they pay off only on hard or open-ended tasks.

The most important thing about multi-agent systems is knowing when not to build one. We'll start there.

11.1 The cost of more agents

Every additional agent adds:

Latency. More sequential LLM calls.
Cost. Each agent runs its own reasoning.
Failure surface. Any agent can hallucinate or loop.
Debugging pain. Traces fan out into trees.
Prompt drift. Each agent's prompt is a config knob to tune.

Anthropic's rule, from Building Effective Agents, deserves to be tattooed on every AI engineer:

Use the simplest thing that works.

If a single agent with clear tools can do the job — even if the prompt is long — prefer that. Graduate to multi-agent only when there's a real reason.

11.2 When multi-agent earns its keep

Cases where splitting helps:

Context separation. A research agent's context gets huge; a separate "summarizer" agent keeps each clean.
Role specialization. A code writer and a code reviewer with different prompts produce better PRs together than one agent playing both roles.
Parallelism over subproblems. Dispatch 10 workers to research 10 competitors, then a synthesizer combines.
Security boundaries. The "planner" with no credentials; the "executor" with scoped credentials; the "reviewer" with read-only.
Cost/latency tiering. Cheap/fast model for routing; expensive/slow model only for the subtask.

11.3 The five patterns you'll actually use

flowchart TB
    subgraph pat1[Pattern 1: Router]
    R1[Router] --> S1[Specialist A]
    R1 --> S2[Specialist B]
    R1 --> S3[Specialist C]
    end
    subgraph pat2[Pattern 2: Orchestrator-Worker]
    O[Orchestrator] --> W1[Worker 1]
    O --> W2[Worker 2]
    O --> W3[Worker 3]
    W1 --> O
    W2 --> O
    W3 --> O
    end
    subgraph pat3[Pattern 3: Planner-Executor-Reviewer]
    P[Planner] --> E[Executor]
    E --> V[Reviewer / Critic]
    V -->|retry| E
    V -->|ok| OUT[Done]
    end
    subgraph pat4[Pattern 4: Debate]
    A1[Agent A] --- A2[Agent B]
    A1 --> J[Judge]
    A2 --> J
    end
    subgraph pat5[Pattern 5: Hierarchical]
    H[Top planner] --> M1[Manager]
    M1 --> EX1[Executor]
    M1 --> EX2[Executor]
    end

Pattern 1 — Router

A thin LLM call selects which downstream agent / prompt / model / workflow to use. Everything else is vanilla. Most common pattern in production.

Pattern 2 — Orchestrator + workers

An orchestrator agent decomposes a task into parallel subtasks, dispatches to workers, and synthesizes. Anthropic's own research system uses this shape — one "lead" agent spawning search subagents.

Pattern 3 — Planner → Executor → Reviewer

One agent plans, one executes, one critiques. A loop until the reviewer approves. Used in coding agents, writing pipelines, and scientific-style workflows.

Pattern 4 — Debate

Two agents argue different sides; a third judges. Expensive but useful for high-stakes reasoning (constitutional AI research, hard math, adversarial evaluation).

Pattern 5 — Hierarchical

Planner → managers → executors. Scales to long, tree-shaped tasks. Watch out for compounding latency.

11.4 A concrete orchestrator-worker example

Anthropic's public write-up of their research agent gives a clean shape:

flowchart TB
    U[User asks complex question] --> L[Lead agent]
    L --> P[Plan: decompose into subqueries]
    P --> S1[Subagent: topic A]
    P --> S2[Subagent: topic B]
    P --> S3[Subagent: topic C]
    S1 --> SUM[Summaries + citations]
    S2 --> SUM
    S3 --> SUM
    SUM --> L
    L --> W[Writer agent
compose final answer]
    W --> C[Citation checker agent]
    C --> ANS[Final answer with sources]

Key choices:

Each subagent has its own clean context — they don't see each other.
Subagents return structured summaries, not raw docs.
The writer sees summaries, not full raw context.
A dedicated citation checker runs as a last gate.

That architecture took Anthropic from 45% to 90%+ on their internal research benchmark.

11.5 Coordination and memory

Multi-agent systems need to agree on something. Options:

Shared scratch file. Simple and durable. Agents read/write a Markdown or JSON file.
Blackboard pattern. A structured KV store all agents can read and post to.
Message bus. Pub/sub (Redis, Pub/Sub, Kafka) for asynchronous coordination.
Durable orchestrator. Temporal or Step Functions owns the state; agents are stateless workers.

For backend engineers, Temporal + LLMs is a particularly good fit: you get retries, timeouts, visibility, and replay for free, and the LLM call is just another activity.

flowchart LR
    subgraph Temporal workflow
    S[Start] --> A1[Activity: plan
LLM call]
    A1 --> A2[Activity: execute step 1]
    A2 --> A3[Activity: execute step 2]
    A3 --> A4[Activity: review
LLM call]
    A4 -->|fail| A2
    A4 -->|ok| E[End]
    end

11.6 Observability is non-negotiable

A multi-agent system is a tree of LLM calls. Logs won't save you. Traces will.

Emit OpenTelemetry spans for every agent turn and tool call. Hosted options:

Langfuse — open source; trace view; evals; good default.
LangSmith — the LangChain-native option; strong UI.
Arize Phoenix — open source; strong for experimentation.
Braintrust — evals-first platform, clean UI.
Helicone — cost + logs, very light.
OpenTelemetry + Honeycomb / Datadog — if you already have APM, just add the semantic conventions for GenAI.

Minimum fields in each span: model, prompt hash, input tokens, output tokens, latency, cost, tool name (if applicable), error.

11.7 Frameworks

LangGraph (LangChain). State-machine of agents, strong streaming, most mature.
CrewAI. Role-based teams, opinionated, quick to demo.
AutoGen (Microsoft). Conversable agents, research-friendly.
OpenAI Agents SDK / Claude Agent SDK. Lower-level, model-native.
Temporal + your own agents. Durable, replayable, production-shaped.
DSPy (Stanford). Declarative; the "compiler" compiles your program of LLM modules.

If you're a backend engineer, LangGraph or Temporal are the two I'd start with.

11.8 A word on "swarm" and fully autonomous agents

Research and demos showcase swarms of 100+ agents self-organizing to accomplish goals. As of 2026, these are mostly fascinating science projects; production versions remain rare outside narrow domains. The bottlenecks are evaluation, debugging, and cost. Watch the space; don't bet your roadmap on it yet.

11.9 Practical advice

Start with one agent. Add the second only when you can name the specific failure it fixes.
Each subagent: clean context, narrow tool set, focused prompt.
Decide shared state up front. Scratch file, blackboard, or durable workflow.
Budget at the system level. Total tokens, total wall clock, total dollars.
Evals at the system level. End-to-end tasks with expected outcomes.
Treat prompts as code. Version them, review them, test them, deploy them.
Cache aggressively. A multi-agent system redoes enormous amounts of prefix work.

11.10 A half-page to take away

Single agent + tools                 [default]
+ router                            [if many paths]
+ orchestrator + workers            [if parallelizable]
+ planner / executor / reviewer     [if quality-critical]
+ hierarchical                      [if genuinely tree-shaped]
Swarm                               [research]

Climb the ladder only as the task forces you to.