What Are Claude Managed Agents | Dreaming, Outcomes, and Feature Overview
This article covers the three features added to Anthropic's Managed Agents — Dreaming, Outcomes, and multi-agent orchestration — with a focus on the decision-making criteria developers need when putting them into practice. From working with the research preview to rubric design principles and the auditability of parallel processing, everything you need to make an informed adoption decision is covered in one place.
Each of the three Managed Agents features is at a different stage of maturity. Dreaming is still in research preview, so the optimal approach for now is to avoid building production dependencies on it and instead focus on establishing your evaluation cycle. Concentrate on getting ready to integrate it quickly once it becomes generally available.
Outcomes delivers maximum impact when you define rubrics with 3–5 axes and weights. In test environments, task success rates improved by up to 10 points. It can be applied immediately to RAG hallucination detection and QA regression testing, enabling low-cost automation of quality assurance loops.
Multi-agent orchestration boosts throughput through parallel execution and model specialization, and each sub-agent can be individually audited in the Console — making it an accessible design choice even in regulated industries. Note that the figures cited are from test environments; validating against your own team's use cases is essential.
目次 (12)
- What Is Dreaming — How Agents Learn from Past Sessions and Improve on Their Own
- Outcomes — Simply Describing Success Criteria with a Rubric Can Raise Task Success Rates by Up to 10 Points
- Rubric Design Principles — Graduated Evaluation Axes Improve Accuracy
- Application Scenarios for RAG Pipelines and QA
- Multi-Agent Orchestration — A Lead Agent Divides the Work, Specialists Execute in Parallel
- Auditability of Each Sub-Agent Process in the Console
- Practical Use Cases — Code Review Assistance, Incident Investigation, and Document Generation
- Code Review Assistance — Parallel Review by 3 Specialist Agents
- Incident Investigation — Parallel Analysis of Multiple Log Sources and Aggregated Report
- Document Generation — Continuous Improvement Loop Combining Dreaming and Outcomes
- Clarifying What You Can and Cannot Do Right Now — Dreaming Is in Research Preview
- Sources
What Is Dreaming — How Agents Learn from Past Sessions and Improve on Their Own
Dreaming is a feature that enables an agent to autonomously analyze its past session histories and memory stores, extract patterns, and improve its own performance. True to its name, it draws inspiration from the process by which humans consolidate learning during sleep by organizing memories.
A scheduled batch process runs periodically, reading through past agent session interactions and the contents of memory stores. It then statistically analyzes and curates patterns — determining which instructions led to success and which patterns tend to result in failure — and runs a long-term self-improvement cycle.
Developers can choose between "auto-update mode" and "update mode with a confirmation flow." The former applies patterns extracted by the agent directly, while the latter allows developers to review the diff before applying changes. Teams that prioritize production stability are advised to use the confirmation flow.
An important caveat: Dreaming is currently in the research preview stage (source: 9to5Mac — Anthropic Updates Claude Managed Agents With Three New Features). It is not a feature ready for production use, and its design principles and API specifications are subject to change. With this in mind, avoid building excessive architectural dependencies on it.
The future where the effort spent on manual prompt trial-and-error is automated is now within realistic reach. The timeline for Dreaming's general availability has not been announced, but understanding the concept now will enable faster adoption decisions after release. The prudent approach is to design your evaluation cycle now so you can integrate Dreaming seamlessly once it becomes generally available.
Outcomes — Simply Describing Success Criteria with a Rubric Can Raise Task Success Rates by Up to 10 Points
Outcomes is a system where an independent grader evaluates the results produced by an agent and returns quantitative feedback. Developers simply describe "what success looks like" as a rubric, and the grader scores the final output independently from the agent's reasoning process.
In test environments among early adopters, agents using Outcomes showed task success rate improvements of up to 10 percentage points compared to standard prompts (source: 9to5Mac, cited above). However, these figures come from specific tasks in test environments and do not guarantee equivalent improvements for all use cases. How much improvement you can actually achieve for your team's use case depends heavily on the precision of your rubric design and the nature of the target task.
Rubric Design Principles — Graduated Evaluation Axes Improve Accuracy
A rubric becomes more effective the more it describes multiple perspectives in a graduated manner, rather than a simple pass/fail binary judgment. For a "code review assistance" task, defining independent evaluation axes such as "does it flag security issues," "does it include performance improvement suggestions," and "does it comment on code style consistency" improves the grader's accuracy.
Weighting the evaluation axes is also recommended. Setting business-critical dimensions to higher weights makes the success rate figure more practically meaningful. A realistic approach is to start with around 3–5 axes in the initial design and refine them gradually while observing actual outputs.
Below is a sample rubric definition for a "code review assistance" task. Use it as a reference for the structure of describing axis names, criteria, and weights together.
{
"task": "Code Review Assistance",
"axes": [
{
"name": "Security",
"description": "Does it flag issues with authentication, input validation, and key management?",
"scale": "0–3",
"weight": 0.4
},
{
"name": "Performance",
"description": "Does it flag obvious computational complexity and memory inefficiencies?",
"scale": "0–3",
"weight": 0.3
},
{
"name": "Style Consistency",
"description": "Does it comment on adherence to coding conventions and naming rules?",
"scale": "0–3",
"weight": 0.2
},
{
"name": "Test Coverage",
"description": "Does it flag missing tests for edge cases?",
"scale": "0–3",
"weight": 0.1
}
],
"pass_threshold": 2.0
}
In this example, "Security (weight 0.4)" is given the highest priority, and each axis uses a graduated 0–3 point scale. For code review assistance, this can also be combined with model specialization — assigning style checks to a lightweight model (such as Haiku) and security checks to a high-accuracy model (such as Opus).
Application Scenarios for RAG Pipelines and QA
In RAG (Retrieval-Augmented Generation) pipelines, embedding citation accuracy and answer grounding into the rubric enables automated hallucination detection. Simply setting evaluation axes such as "does the cited source exist" and "does the answer stay within the bounds of the cited source" is enough to build a quality assurance loop.
For QA use cases, defining expected output patterns for test inputs as a rubric makes it possible to build an evaluation loop similar to regression testing. Since Outcomes returns a quantitative score each time the agent is refined, quality regressions can be detected early. This is a feature that can be put into production immediately as a means of continuously raising output quality while significantly reducing the cost of building an evaluation infrastructure.
Multi-Agent Orchestration — A Lead Agent Divides the Work, Specialists Execute in Parallel
In multi-agent orchestration, a lead agent splits a received job into multiple subtasks and delegates each to specialist sub-agents that hold their own models, prompts, and tools. The sub-agents run in parallel on a shared file system, and once processing is complete, the lead agent aggregates the results.
The practical advantage of this design lies in improved processing throughput. Compared to handling complex tasks sequentially, specialist division of labor combined with parallel execution dramatically reduces total wait time. The effect of batch processing that once required engineers to wait for long periods now approaching interactive-feeling speeds through parallelization is significant.
It is also important that each sub-agent can select a model optimized for its assigned task. Optimizations such as assigning lightweight models to cost-efficiency-focused subtasks and high-performance models to accuracy-critical subtasks become achievable.
Auditability of Each Sub-Agent Process in the Console
The Managed Agents Console allows each sub-agent's process to be monitored individually. The design makes it transparently clear which sub-agent is executing which step, what the intermediate outputs are, and at which stage an error occurred (source: 9to5Mac, cited above).
This auditability is a significant justification for engineering teams. Because multi-agent processes do not become a black box, diagnosing quality issues, debugging, and responding to audits become realistic undertakings. Particularly in regulated industries or environments requiring internal controls, process transparency can be the deciding factor in an adoption decision. It is one of the first important points any team considering multi-agent orchestration should evaluate.
Practical Use Cases — Code Review Assistance, Incident Investigation, and Document Generation
Here are three concrete configuration examples for combining the above three features in a production environment. All are designed around multi-agent orchestration as the core, with Outcomes integrated for quality evaluation as the basic pattern.
Code Review Assistance — Parallel Review by 3 Specialist Agents
A lead agent receives a pull request diff and assigns it to three sub-agents: "security check," "performance check," and "style check." Each specialist has prompts and analysis tools tailored to its assigned perspective and runs its review in parallel. The lead agent aggregates the results and outputs them as review comments.
Combining this with Outcomes allows you to define in the rubric whether "HIGH-severity findings were not overlooked," enabling quantitative evaluation of review coverage. The larger the development team, the higher the risk of oversights in manual reviews. This architecture is especially effective for teams with broad codebases.
Incident Investigation — Parallel Analysis of Multiple Log Sources and Aggregated Report
Three sources — application logs, infrastructure logs, and monitoring alerts — are received, and a dedicated sub-agent analyzes each in parallel. The lead agent generates an aggregated report containing "time of occurrence," "scope of impact," "estimated cause," and "recommended actions."
The time that previously required multiple engineers to manually read through logs together can be dramatically compressed through parallel processing. As a use case that directly contributes to reducing MTTR (mean time to recovery), it delivers high value when introduced to on-call teams.
Document Generation — Continuous Improvement Loop Combining Dreaming and Outcomes
Dreaming is used to learn quality patterns from past document generation sessions, and Outcomes is configured with a rubric using evaluation axes of "clarity for the reader," "technical accuracy," and "comprehensiveness." This builds a continuous improvement loop where Dreaming continuously optimizes the prompts and Outcomes provides feedback with each release.
However, as noted above, Dreaming is currently in the research preview stage. The realistic approach is to design this as an experimental combination in anticipation of the production transition, and to prepare a foundation that allows for smooth production deployment once Dreaming becomes generally available.
Clarifying What You Can and Cannot Do Right Now — Dreaming Is in Research Preview
Clearly distinguishing between features that can be used in production right now and those that are still in a preparatory stage is important for preventing design failures.
| Feature | Current Status | Production Use |
|---|---|---|
| Outcomes | Generally Available | ✅ Available |
| Multi-Agent Orchestration | Generally Available | ✅ Available |
| Dreaming | Research Preview | ❌ Not available for production |
Outcomes and multi-agent orchestration can be piloted on Managed Agents starting today. The safe approach is to start with a small-scale, single use case (e.g., part of code review assistance), validate rubric design precision and actual success rate improvements, and then expand to a broader scope once you have accumulated real-world data from production.
For Dreaming, while it is published as a research preview, integrating it into production systems is not recommended at this time (source: 9to5Mac, cited above). Since no timeline for production availability has been announced, it is appropriate to keep your involvement to understanding the concepts and mechanics without building architectural dependencies on it. That said, you can begin designing evaluation criteria (rubric structures) and multi-agent orchestration architectures now. Preparing a solid foundation so you can integrate Dreaming quickly once it is officially released will translate into a medium-to-long-term competitive advantage.
Note that the foundational architecture of Managed Agents is covered in detail at scaling-managed-agents. The three features discussed in this article are positioned as an application layer built on top of that architecture. For implementing the agent foundation, refer to claude-agent-sdk; for practical examples of running sub-agents in action, see claude-code-subagent.