Multi-Agent Orchestration Patterns That Actually Ship

The 2024 multi-agent demos showed swarms of agents collaborating on impressive tasks. The 2026 production picture is more disciplined. The patterns that survive contact with real workloads are fewer in number and structured around specific failure modes the patterns are designed to prevent. Four patterns matter most: planner-worker, supervisor-team, critic-reviser, and hierarchical. Each fits a specific shape of problem; each fails in characteristic ways when applied to the wrong shape.

Why Multi-Agent at All

The default reaction to a complex task is "use multiple agents." That instinct is usually wrong. A single well-prompted agent with a coherent tool set handles most tasks better than two agents trying to coordinate. Multi-agent earns its complexity in three specific situations:

Specialisation matters. Different agents need different system prompts, different model tiers, or different tool sets. A research agent reading documents has different needs from a writing agent composing the report.
Parallelism speeds up the task. Several independent subtasks run concurrently. One planner decomposes; many workers execute in parallel.
Quality requires adversarial review. A separate critic agent applies a different prompt and produces an evaluation the original cannot. Useful where the task has a clear quality bar.

Outside these cases, multi-agent adds latency, cost, and coordination failure modes for negligible benefit.

Pattern 1: Planner-Worker

A planner agent decomposes the task into independent subtasks. Worker agents execute the subtasks. A composer agent (often the planner returning) assembles the results. The shape that fits this pattern is one where the user request can be reduced to "do these N independent things, then combine."

// Planner-worker, conceptual
async function plannerWorker(userTask: string) {
  const plan = await planner.run({
    system: PLANNER_PROMPT,
    user: userTask,
  });
  // plan.subtasks: [{ id, description, requires }]

  const workerResults = await Promise.all(
    plan.subtasks.map(async (subtask) => {
      const result = await worker.run({
        system: WORKER_PROMPT,
        user: subtask.description,
        tools: subtask.requires,
      });
      return { id: subtask.id, output: result };
    }),
  );

  return await composer.run({
    system: COMPOSER_PROMPT,
    user: `Original task: ${userTask}\n\nSubtask results:\n${JSON.stringify(workerResults)}`,
  });
}

Strengths: clear cost accounting (you can see which subtask burned tokens), straightforward parallelism, easy to debug (each worker's input and output are isolated). Weaknesses: the planner's plan must be good. If the decomposition is wrong, every worker gets the wrong assignment. If the subtasks have hidden dependencies, parallel execution produces inconsistent state.

Pattern 2: Supervisor-Team

A supervisor agent receives the user task and decides which specialist to invoke. The specialists each have narrow expertise. The supervisor maintains the conversation, accepts results, and decides whether to ask another specialist, return the answer, or revise.

This pattern fits when the task is exploratory and the right specialist depends on intermediate findings. "Help me investigate this customer issue" might involve a logs specialist, a database specialist, and a documentation specialist, with the supervisor deciding which to call as the picture clarifies.

Strengths: handles open-ended tasks well, adapts to unexpected directions. Weaknesses: the supervisor is a bottleneck; the conversation is sequential. Communication overhead grows. The supervisor's prompt has to encode when to use each specialist, which becomes long and brittle.

Pattern 3: Critic-Reviser

An author agent produces a draft. A critic agent evaluates it against criteria the system prompt specifies. The author revises in response to the critique. This loops until the critic accepts or a maximum iteration count is reached.

async function criticReviser(task: string, maxIterations = 3) {
  let draft = await author.run({ system: AUTHOR_PROMPT, user: task });

  for (let i = 0; i < maxIterations; i++) {
    const critique = await critic.run({
      system: CRITIC_PROMPT,
      user: `Task: ${task}\n\nDraft:\n${draft}\n\nIs this acceptable? List specific issues if not.`,
    });

    if (critique.accept) return draft;

    draft = await author.run({
      system: AUTHOR_PROMPT,
      user: `Original task: ${task}\n\nPrevious draft:\n${draft}\n\nCritic feedback:\n${critique.issues}\n\nProduce a revised version.`,
    });
  }

  return draft;
}

Strengths: improves quality on tasks where evaluation is easier than generation. Translation review, code review, content moderation, summary refinement. Weaknesses: cost doubles or triples per task. The critic can converge on its own preferences, making the system stylistically narrow over time.

Pattern 4: Hierarchical

A top-level agent acts as a manager. Below it sit team-level agents, each one a manager of its own specialists. The hierarchy can extend three or four levels in theory; in practice, two or three levels is the maximum that produces predictable behaviour.

This pattern fits genuinely large tasks: writing a book, conducting deep research over weeks, managing a long-running project. The cost is enormous (each level multiplies token use), and failure modes compound up the hierarchy. Most teams that try hierarchical end up flattening to two levels or moving to a different pattern.

Communication Protocols

Multi-agent systems need a communication protocol. Three options work in production:

Structured tool calls. The supervisor calls "ask_specialist(specialist_id, question)" as a tool. The framework routes the call and returns the result. Clean, typed, debuggable.
Message passing via queue. Agents emit messages onto a shared queue tagged with recipient. Useful for asynchronous workflows. Adds operational complexity.
Shared state object. Agents read and write fields on a shared blackboard. Powerful but produces invisible coupling; debugging gets hard.

The structured-tool-call pattern is the default that works for most production systems. The other two earn their place when the workflow shape genuinely requires them.

Identity Propagation

When agent A calls agent B, whose identity does B act under? If A acts on behalf of user U, B should also act on behalf of U with U's permissions, not under a shared service account that aggregates the union of all user permissions. Microsoft Entra Agent ID and similar identity systems handle this propagation through the OAuth on-behalf-of flow. Without identity propagation, multi-agent systems silently break least-privilege at scale.

Cost Compounding

A user request that costs C tokens to a single agent costs roughly:

Planner-worker: 2C–5C (planner overhead plus parallel workers).
Supervisor-team: 3C–10C (sequential sub-conversations; supervisor accumulates context).
Critic-reviser: 2C–4C per iteration.
Hierarchical (2 levels): 5C–20C.
Hierarchical (3 levels): 20C–100C.

The implication is direct: the answer to "should we use multi-agent?" is rarely yes for cost-sensitive workloads. Justify the cost increase against measurable quality or capability gain.

Failure Modes

Cascading errors. An early agent's mistake propagates through every downstream step. Defence: validate intermediate outputs against expected shapes; fail fast.
Goal drift. Each agent reinterprets the task slightly. By the third or fourth hop, the system is solving something subtly different. Defence: pass the original user task through every level; do not paraphrase.
Coordination deadlock. Two agents waiting on each other in a circular dependency. Defence: timeouts on every cross-agent call; supervisor steps in.
Memory contamination. Persisted state from one task influences unrelated later tasks. Defence: scope memory to the current task; clear at task boundaries.
Confused supervisor. Long context plus many specialists confuses the supervisor about who to call next. Defence: keep the specialist roster small (under 10), keep the supervisor's context tight, summarise rather than accumulate.

When to Stay Single-Agent

The most common multi-agent mistake is reaching for the pattern when a single well-tooled agent would suffice. Symptoms that you should stay single-agent:

The task is short (under 10 steps).
All capabilities the task needs can be exposed as tools to one agent.
No genuine specialisation justifies separate prompts.
Cost-sensitivity is high.
Latency-sensitivity is high.

When Multi-Agent Earns Its Cost

The task naturally decomposes into independent subtasks (planner-worker).
Quality matters more than cost and a separate critique pass measurably improves output (critic-reviser).
The task is open-ended and the right specialist depends on intermediate findings (supervisor-team).
The task spans days or weeks and a manager-of-managers structure provides genuine coordination value (hierarchical).

The Recommendation

Default to single-agent. Reach for planner-worker when subtasks are genuinely independent. Reach for critic-reviser when quality matters and the critic can apply a clearer rubric than the author. Reach for supervisor-team when the task is exploratory. Reach for hierarchical only when scale absolutely demands it, and even then, flatten where possible. The pattern that ships is the one whose complexity is justified by a specific failure mode the simpler pattern would hit. Anything else is architecture as decoration.