Claude托管智能体新三件套深度研究

← 返回技术AI

🎯 核心收获

🎯 Key Takeaways

1. Dreaming：让Agent学会"睡觉"进化

1. Dreaming: Let Agents Learn Through "Sleep"

问题：记忆库会积累重复、过时、矛盾的信息，Agent无法感知
Problem: Memory accumulates duplicates, stale entries, and contradictions that agents cannot perceive
解法：定时触发异步任务，读取100个会话+记忆库，生成新记忆库
Solution: Scheduled async task that reads 100 sessions + memory store, generates new memory
三步操作：合并重复项、替换过时条目、挖掘宏观规律
Three steps: Merge duplicates, replace stale entries, discover hidden patterns
效果：Harvey法律AI任务完成率提升6倍
Impact: Harvey Legal AI saw 6x improvement in task completion rate

2. Outcomes：独立评分官机制

2. Outcomes: Independent Evaluator Mechanism

问题：Agent倾向于认为自己做完了，质量评估不客观
Problem: Agents tend to think they're done, quality assessment is not objective
解法：独立评估器(Grader)在独立上下文中打分
Solution: Independent evaluator (Grader) scores in its own context
效果：任务成功率+10%，docx准确率+8.4%，pptx准确率+10.1%
Impact: +10% task success rate, +8.4% docx accuracy, +10.1% pptx accuracy

3. Multi-Agent：主从协作模式

3. Multi-Agent: Lead-Sub Architecture

架构：Lead Agent + Specialist Sub-Agents
Architecture: Lead Agent + Specialist Sub-Agents
特点：并行工作、共享文件系统、全流程可追溯
Features: Parallel work, shared filesystem, full traceability
案例：月球着陆模拟从67%提升到100%
Case: Moon landing simulation improved from 67% to 100%

一、技术背景与问题定义

I. Technical Background & Problem Definition

1.1 Claude Managed Agents的定位

1.1 Claude Managed Agents Positioning

Claude Managed Agents是Anthropic在2026年4月8日发布的托管智能体平台，旨在帮助开发者快速构建和部署AI Agent。核心卖点是"10倍加速"——通过模块化Agent模板、集成记忆存储和增强编排框架，让开发者无需自己搭建基础设施。

Claude Managed Agents is a managed agent platform launched by Anthropic on April 8, 2026, designed to help developers quickly build and deploy AI agents. The core value proposition is "10x acceleration" - modular agent templates, integrated memory stores, and enhanced orchestration frameworks eliminate infrastructure complexity.

维度	Dimension	传统方式	Traditional	Managed Agents
开发周期	Dev Cycle	数周	Weeks	分钟级	Minutes
基础设施	Infrastructure	需自建	Build	托管服务	Managed
记忆管理	Memory	手动实现	Manual	Memory Store

1.2 三个核心痛点

1.2 Three Core Pain Points

痛点1：跨会话记忆衰退

Pain Point 1: Cross-session Memory Decay

任何长期运行的Agent都会遇到上下文膨胀问题：

Any long-running agent faces context window limitations:

同样的项目结构，被不同会话重复学习50次
Same project structure learned 50 times across sessions
一个月前定的接口规范已更新，但记忆里还保留旧版本
API spec updated last month but old version still in memory
跨项目的隐性模式从未被沉淀
Cross-project patterns never consolidated

痛点2：输出质量不稳定

Pain Point 2: Unstable Output Quality

LLM Agent有一个天然bias：它倾向于认为自己做完了。会话内自评，本质上是"让被告当法官"。

LLM agents have a natural bias: they tend to think they're done. In-session self-assessment is essentially "letting the defendant be the judge."

痛点3：复杂任务超出单Agent能力

Pain Point 3: Complex Tasks Exceed Single Agent Capability

当任务涉及多领域知识或需要并行处理时，单Agent的局限明显。

Single agent limitations become apparent when tasks involve multiple domains or require parallel processing.

二、Dreaming：离线自我进化机制

II. Dreaming: Offline Self-Improvement Mechanism

2.1 核心原理

2.1 Core Principles

Dreaming的灵感来源于人类REM睡眠机制。白天大脑吸收原始信息存成短期记忆，夜间REM阶段把当天经历重放一遍，强化有价值的连接、丢弃无用信息、整合成长期记忆。

Dreaming's inspiration comes from human REM sleep mechanisms. During the day, the brain stores raw information as short-term memory; during REM sleep, it replays experiences, strengthens valuable connections, discards useless information, and consolidates long-term memory.

Dreaming surfaces patterns that a single agent can't see on its own, including recurring mistakes, workflows that agents converge on, and preferences shared across a team.

Dreaming reveals patterns that a single agent cannot see independently, including recurring mistakes, workflow convergence points, and team-wide preferences.

2.2 工作流程

2.2 Workflow

触发条件：定时/会话数阈值
Trigger: Schedule or session count threshold
读取数据：现有Memory Store + 最多100个历史会话
Read: Existing Memory Store + up to 100 past sessions
执行操作：
- 合并重复项
- 替换过时条目
- 挖掘宏观规律
Execute:
- Merge duplicates
- Replace stale entries
- Discover hidden patterns
生成输出：新Memory Store（原始存储不修改）
Output: New Memory Store (original unchanged)
审核应用：开发者审核或自动应用
Review: Developer review or auto-apply

2.3 技术实现

2.3 Technical Implementation

from anthropic import Anthropic
client = Anthropic()

# 触发Dreaming
dream = client.beta.dreams.create(
    model="claude-sonnet-4-6",  # 或 claude-opus-4-7
    memory_store="ms_abc123",
    instructions="重点关注工具调用相关的模式"
)

# 等待完成
while dream.status in ("pending", "running"):
    dream = client.beta.dreams.retrieve(dream.id)

# 获取输出的新Memory Store
new_store_id = dream.outputs[0].id

2.4 关键设计原则

2.4 Key Design Principles

1. 原始数据不修改

1. Original Data Never Modified

Dreaming永远不会修改输入的原始记忆库。它生成的是一个全新的Memory Store实例，开发者可以预览新旧差异，决定apply或discard。

Dreaming never modifies the original input memory store. It generates a new Memory Store instance; developers can preview the diff and decide whether to apply or discard.

2. 实时监控

2. Real-time Monitoring

Dream任务进入running状态后，开发者可以流式订阅事件流：实时看到AI正在读取哪条记忆、正在写入什么新条目，发现问题时可随时"叫醒"（取消）。

When a Dream task enters running state, developers can stream-subscribe to events: see in real-time what memory is being read and what new entries are being written, and "wake up" (cancel) if issues are found.

"你趴在AI的床边，看着它做梦。"

"You're lying beside the AI's bed, watching it dream."

2.5 真实案例：月球着陆任务

2.5 Real Case: Moon Landing Mission

初始场景：6个候选着陆点，第一轮跑完2个点坠毁

Initial: 6 candidate landing sites, 2 crashed in first run

触发Dreaming：选择Opus 4.7模型，点击"Start dreaming"

Trigger Dreaming: Select Opus 4.7 model, click "Start dreaming"

执行结果：

耗时：8分钟
消化：530万token历史会话
输出：98行《Lumara Descent Commander's Playbook》

Result:

Duration: 8 minutes
Processed: 5.3 million tokens
Output: 98 lines "Lumara Descent Commander's Playbook"

第二天验证：原来失败的2个站点全部修复，整体安全评分从67%提升到100%

Day 2 Verification: Previously failed 2 sites all fixed, overall safety score improved from 67% to 100%

三、Outcomes：独立评估器机制

III. Outcomes: Independent Evaluator Mechanism

3.1 核心问题：模型自评的bias

3.1 Core Problem: Self-Evaluation Bias

LLM Agent有一个天然bias：它倾向于认为自己做完了。传统评估方式的局限是Agent在自己的上下文中评估，受限于自己的推理过程。

LLM agents have a natural bias: they tend to think they're done. Traditional evaluation is limited because agents assess within their own context, influenced by their own reasoning.

本质上是让被告当法官。

Essentially letting the defendant be the judge.

3.2 解法：独立评估器

3.2 Solution: Independent Evaluator

Outcomes的解法：派一个独立的Evaluator Agent来当法官。

Outcomes solution: Deploy an independent Evaluator Agent as the judge.

开发者定义评分标准（rubric）
Developer defines scoring rubric
系统分配独立评估器，在自己的上下文中评估
System assigns independent evaluator with its own context
逐条打分，不通过的返回修改
Score item by item, return for revision if failed
迭代直到达标或达到最大次数（默认3次，最高20次）
Iterate until passing or max iterations (default 3, max 20)

3.3 评分标准示例

3.3 Rubric Example

# 文档生成评分标准

## 功能完整性
- [ ] CSV文件包含price列
- [ ] price列值为数值类型
- [ ] 数据行数不少于100条

## 格式规范
- [ ] 文件编码为UTF-8
- [ ] 列头名称正确
- [ ] 无空行

## 质量要求
- [ ] 无重复数据
- [ ] 无缺失值
- [ ] 数据在合理范围内

3.4 效果数据

3.4 Impact Data

指标	提升幅度	Improvement
整体任务成功率	Overall Task Success	+10%
docx文档生成准确率	+8.4%
pptx幻灯片生成准确率	+10.1%

越难的问题，提升越明显——因为复杂任务更容易出现遗漏和错误，评估器的作用更大。

The harder the problem, the greater the improvement - complex tasks are more prone to omissions and errors, where the evaluator's value is greatest.

四、Multi-Agent Orchestration：多智能体协作

IV. Multi-Agent Orchestration: Multi-Agent Collaboration

4.1 架构设计

4.1 Architecture Design

Multi-Agent编排采用主从架构：

Multi-Agent orchestration uses a lead-sub architecture:

Lead Agent（主智能体）
├── 任务拆解
├── 子Agent调度
├── 上下文整合
└── 最终交付

Sub-Agents（子智能体）
├── Specialist A（专业领域A）
├── Specialist B（专业领域B）
└── ...

每个子Agent：

运行在独立线程
拥有独立上下文
可配置不同模型
可配置不同系统提示词
可配置专属工具集

Each Sub-Agent:

Runs in isolated thread
Has independent context
Configurable model
Configurable system prompt
Configurable tools

4.2 真实案例

4.2 Real Cases

Harvey法律AI：任务完成率提升6倍

Harvey Legal AI: 6x improvement in task completion rate

Netflix日志分析：并行处理数百个构建来源，只呈现跨多应用的重复模式

Netflix Log Analysis: Parallel processing across hundreds of build sources, highlighting cross-application patterns

Spiral写作Agent（by Every）：

主Agent运行在Haiku上：接收请求、追问澄清、调度写作
子Agent运行在Opus上：实际执行写作
使用Outcomes确保每份草稿达到编辑标准

Spiral Writing Agent (by Every):

Lead agent on Haiku: receive requests, clarify, coordinate writing
Sub-agents on Opus: execute actual writing
Uses Outcomes to ensure every draft meets editorial standards

五、三件套组合效果

V. Combined Impact of the Three Features

5.1 互补关系

5.1 Complementary Relationship

功能	解决的问题	Problem Solved	状态
Dreaming	跨会话记忆衰退	Cross-session memory decay	研究预览	Research Preview
Outcomes	输出质量不稳定	Unstable output quality	公测	Public Beta
Multi-Agent	复杂任务处理	Complex task handling	公测	Public Beta

5.2 价值演进

5.2 Value Evolution

阶段	能力	Capability
第一阶段	Phase 1	运行环境托管	Runtime hosting
第二阶段	Phase 2	Memory Store
第三阶段	Phase 3	三件套：自主进化、质量保障、复杂任务处理	Three features: self-improvement, quality assurance, complex tasks

六、总结

VI. Summary

核心要点

Key Points

Anthropic给AI装上了REM睡眠，让Agent可以在"不工作"的时间里自主整理记忆、发现规律、自我进化。配合独立评分官和多Agent协作，把AI Agent从"能跑"推向"能用"。

Anthropic gave AI its own REM sleep, enabling agents to independently organize memories, discover patterns, and self-improve during "off" time. Combined with independent evaluators and multi-agent collaboration, this pushes AI agents from "can run" to "can use".

对看宝AI的启发

Implications for KanBao AI

记忆进化机制：当前知识库是静态的，需要引入动态优化机制
Memory evolution: Current knowledge base is static, needs dynamic optimization
质量保障体系：当前依赖人工审核，考虑建立自动化评分标准
Quality assurance: Currently relies on manual review, consider automated scoring standards
多Agent协作：当前是单Agent模式，未来考虑引入专家Agent协作
Multi-agent collaboration: Currently single-agent mode, consider expert agent collaboration in the future