Multi-Agent Design | Anthropic's Publicly Disclosed 15x Token Method

Many engineers feel hesitant to adopt multi-agent architectures due to cost concerns. Anthropic has published a detailed breakdown of the design behind their internal research system — a setup that uses 15x more tokens yet achieves a 90% performance gain. This article summarizes the full picture of that decision.

Article Summary by AI Chatpowered by Claude
結論powered by Claude

The internal research system Anthropic published on June 12, 2026 uses an architecture where a lead agent analyzes a question and launches multiple sub-agents in parallel. Each sub-agent operates within its own independent context, preventing interference, and long-horizon tasks that a single agent cannot complete can be realistically finished through parallel decomposition.

Behind the decision to accept 15x token usage in exchange for 90% performance improvement lies the perspective of "how many hours of human work can be replaced in a single task run." While parallelization costs are well justified for long-term research and multi-perspective analysis, the official documentation also makes clear that applying this approach to simple Q&A tasks leads to a dramatic drop in cost-effectiveness.

For production deployment, the three key design factors are long-term state management, error recovery, and observability. The difference between a practical system and an experimental prototype lies in designing sub-agents so that individual failures do not halt the whole, and building a logging system that lets humans trace what ran and what failed.

目次 (19)

What Is Anthropic's Internal "Multi-Agent Research System"?

On June 12, 2026, Anthropic published an engineering blog post titled "Designing a Multi-Agent Research System" (source: https://www.anthropic.com/engineering/multi-agent-research-system). This is the first instance of the company systematically sharing design insights from a system it actually runs internally, and it has generated significant interest in the engineering community.

The problem this system set out to solve is straightforward: automatically generating answers to complex research questions at a level of accuracy that a single agent cannot achieve. For example, the system was designed to automatically produce researcher-grade responses to questions like "compare the performance characteristics of competing models across all dimensions, cross-referencing papers and experimental results."

The Fundamental Difference Between Single-Agent and Multi-Agent Systems

The limitations of a single agent come down to three points. First, when handling more information than fits in a single context window, important context gets lost. Second, the inability to conduct "parallel investigation" — exploring multiple angles simultaneously — means everything must be examined sequentially. Third, the longer a task runs, the harder it becomes to recover from mid-task errors, and the cost of starting over from scratch grows.

Multi-agent architectures structurally overcome all three barriers. The information problem is solved by splitting context across agents, each handling their own portion. The parallelization problem is solved by launching multiple agents simultaneously. And the reliability problem for long-horizon tasks is solved by narrowing each agent's scope of responsibility so that individual failures do not cascade into the whole.

The Connection to Claude Code v2.1.172

Two days before the blog post, on June 10, 2026, Claude Code v2.1.172 shipped with support for five levels of sub-agent nesting (source: https://github.com/anthropics/claude-code/releases/tag/v2.1.172). Reading the blog post in the context of Anthropic opening up the design patterns it uses internally to developers via Claude Code makes the significance of this publication much clearer. The major value of this release is that it shares not just how to use the tools, but the conceptual framework behind why things are designed the way they are.

The Core of the Lead Agent + Parallel Sub-Agent Architecture

The architecture Anthropic's system adopts is a two-tier structure: a lead agent plus parallel sub-agents. Tracing how this works in practice reveals the following flow:

  1. The lead agent receives the input research question and analyzes "what needs to be investigated, from which angles, and in what order"
  2. Based on this analysis, multiple sub-agents are launched in parallel (each agent handles an independent task)
  3. Each sub-agent operates within its own context without interfering with the others
  4. Results from all sub-agents are aggregated by the lead agent and synthesized into the final answer

The key point in this flow is that the three stages — analysis, execution, and integration — are clearly separated. Because each tier functions independently, even if part of the execution tier (sub-agents) fails, the analysis and integration tiers are unaffected.

The Advantages of Operating with Independent Contexts

The greatest benefit of each sub-agent having its own independent context is that they do not interfere with one another. Even if one agent produces a long investigation result, it does not compress the context available to other agents. Since each agent only needs to be aware of its own assigned scope, output quality remains stable even as tasks grow longer.

Additionally, if some sub-agents fail, the others continue running unaffected. This "resilience to partial failure" is an essential property for running long-horizon tasks in production. A parallel architecture fundamentally eliminates the single-agent weakness where one failure halts everything.

A Minimal Architecture Concept Engineers Can Replicate

When trying the same design pattern in your own projects, it helps to think in terms of the following three tiers:

  1. Task analysis tier (equivalent to the lead agent): An agent that receives input and decomposes it into sub-tasks. Functions as the coordinator that determines "what should be processed in parallel"
  2. Execution tier (equivalent to sub-agents): Multiple agents that independently execute each sub-task. Non-interference is maintained because they operate without knowledge of each other's results
  3. Integration tier (the downstream portion of the lead agent): Takes results from each execution and assembles the final output. This is also where decisions are made about how to handle failed sub-tasks

Since Claude Code v2.1.172, the sub-agent nesting functionality needed to actually build this three-tier structure has been in place. Multi-agent architectures that were previously only theoretical are now at a stage where developers can try them out immediately.

Why the Decision to Pay 15x More Tokens for 90% Better Performance Was Rational

Anthropic's official blog explicitly states that their system uses 15x more tokens than a single-agent approach while achieving 90% better performance on research tasks (source: https://www.anthropic.com/engineering/multi-agent-research-system). Understanding why this "expensive choice" becomes rational requires changing the framework for calculating cost.

Thinking in "Time Replacement Cost," Not "API Cost"

The conventional approach calculates cost as "API cost × token count." But when using multi-agent systems for complex research tasks, the correct axis of comparison becomes:

  • Cost of running multi-agents vs. the labor cost and time cost of a human doing the same task
  • The quality difference between multi-agent output and single-agent output

For instance, a single agent tasked with "a comprehensive comparative research report on the latest developments across 10 competitors" tends to produce lower-quality results. If a lead + parallel sub-agent setup can handle the same work at an equivalent level within a few hours instead of the one to two days a researcher would need, then 15x the token cost is easily recouped. The question to ask is not "how many times more tokens?" but "how many hours of human work does this replace?"

The Conditions Under Which Parallelization "Pays Off"

Based on the insights Anthropic has published, multi-agent architectures deliver favorable cost-effectiveness for tasks with the following characteristics:

  1. Long-horizon, multi-step tasks: Tasks that only become meaningful by integrating multiple investigations rather than answering a single question — ones where a single run can replace hours to days of human work
  2. Tasks requiring verification from multiple perspectives: Designs where multiple agents verify a single answer from different angles are especially powerful. Cross-verification reduces the blind spots that tend to occur with a single agent
  3. Tasks involving parallel processing of large volumes of data: Scenarios like having 10 agents read and classify 100 documents in parallel. These complete significantly faster than sequential processing

Identifying Tasks That Are Not Worth It

Conversely, applying a multi-agent architecture to the following types of tasks tends to result in poor cost-effectiveness: simple Q&A, summarization tasks with small context that a single agent can handle adequately, and cases where real-time requirements make the overhead of launching parallel agents unacceptable.

Anthropic is explicit about this point, rejecting the notion that "every task should be made multi-agent." The starting point of design should be the question: "Does this task truly require parallelization?" Deciding first whether a task has sufficient complexity and scale to justify the cost is the first step to avoiding unnecessary expense.

Practical Implementation Patterns for Long-Term State Management, Error Recovery, and Observability

For a research system to go beyond being a mere prototype and into production operation, three design challenges must be overcome. Anthropic's blog explicitly states that "the biggest challenges in the production environment were error recovery and observability."

Long-Term State Management: Designing for When Tasks Stall Midway

In a system where multiple sub-agents run in parallel over extended periods, a mechanism for always knowing "which agent has finished what" is indispensable. Beyond simply waiting for all agents to complete, three strategies need to be designed in advance:

  1. Proceed with partial success: A strategy of continuing integration processing using successful results even when some sub-agents fail. Suited for non-critical failures
  2. Retry: A strategy of restarting only the failed sub-agents without re-running those that already succeeded. Suited for tasks where idempotency can be guaranteed
  3. Skip: A strategy of generating the final output without the failed portion when a failed sub-task is of low importance. Suited for supplementary information-gathering tasks

Which strategy to choose depends on the nature of the task, but without a state management mechanism that always knows "which agent is currently in what state," none of these strategies can be executed. State management cannot be addressed retroactively — it must be built in from the initial design phase.

Error Recovery: A Structure That Does Not Let One Failure Stop Everything

In a single-agent setup, "if the agent fails, start over from the beginning" is the common outcome. In a multi-agent setup, the foundational design is "even if 1 out of 10 fails, the results of the other 9 are still usable."

To achieve this, the execution of each sub-agent must be kept completely independent from the others, with no failure allowed to cascade. Concretely, the recommended design avoids direct writes to shared state and treats each agent's results as independent objects. If one agent corrupts its internal state, the data held by other agents is unaffected.

Observability: Can You Quickly Identify What Went Wrong?

The most critical thing in a production environment is that humans can monitor in real time "what this system is currently doing, and which agents are failing." Long-horizon tasks without observability take too long to diagnose when problems occur.

The minimum log design elements to have in place are the following four:

  1. Timestamps for sub-agent launch, completion, and failure
  2. A summary of each agent's input and output (a digest rather than the full text is sufficient)
  3. Retry counts and error details (categorized by error type is ideal)
  4. The number of inputs into the final integration step (how many out of the total completed successfully)

Having all of these in place dramatically speeds up root cause identification when problems occur. Patterns such as sub-agents failing at high rates for specific types of input also become visible through log aggregation.

Early Validation with a Small Evaluation Set — How to Eliminate Failures Before Production

A common trap in multi-agent system development is the approach of "build it at scale, run it, then fix problems." Among the design insights Anthropic has published, the methodology of "early validation with a small evaluation set" is especially valuable in practice.

What Is an Evaluation Set, and What Goes in It?

An evaluation set is a collection of sample tasks representative of what is likely to be input in production, paired with a "standard for expected output" for each task. It functions as the material for repeatedly validating whether the system works correctly before going to production.

A good evaluation set contains three elements. First, representative task samples that cover typical use cases (5 to 20 samples as a guideline). Second, exceptional edge cases such as extremely long inputs, mixed-language content, or incomplete data. Third, criteria for what constitutes "success" in the returned output. Rather than requiring exact matches, it is more practical to design these as quality checkpoints.

The Minimal Validation Steps to Prove "This System Works" Before Production

The flow for conducting early validation with an evaluation set before production looks like this:

  1. Run the system on 5 to 10 samples from the evaluation set and review the outputs
  2. List quality issues in the output (missing information, errors, structural mismatches with expectations)
  3. Identify the root cause of each issue (is it the lead agent's analysis accuracy, a specific sub-agent, or the integration tier?)
  4. After fixing, re-run the same evaluation set to confirm the issues are resolved
  5. Once edge case samples also pass without issues, make the decision to go to production

By iterating through this validation cycle at small scale, you can get ahead of the problems users would encounter in production. According to Anthropic's published insights, validating with an evaluation set of around 10 samples before production was enough to cover the vast majority of actual production issues (source: https://www.anthropic.com/engineering/multi-agent-research-system).

What an Evaluation Loop Looks Like in Claude Code

Since Claude Code v2.1.172, five levels of sub-agent nesting have been available (source: https://github.com/anthropics/claude-code/releases/tag/v2.1.172). When building an evaluation loop using this feature, giving the lead agent the role of "run each task in the evaluation set in parallel and check the output against the success criteria" enables simultaneous evaluation of multiple tasks.

Additionally, Claude Code v2.1.173, released on June 11, 2026, includes improvements to sub-agent stability (source: https://github.com/anthropics/claude-code/releases/tag/v2.1.173), increasing the reliability of error recovery in long-horizon tasks. Building evaluation loops on this version or later yields a more stable validation environment. With the design patterns and tools Anthropic uses internally now coming together, this is one of the lowest-cost moments to step into multi-agent architectures.

Sources

参考になったら ♡
Clauder Navi 編集部
@clauder_navi

Anthropic の Claude / Claude Code を中心に、日本のエンジニア向けに最新動向と実務 を毎日発信。 運営方針 は メディアについて をご覧ください。