🎯 核心收获
🎯 Key Takeaways
1. Dreaming:让Agent学会"睡觉"进化
1. Dreaming: Let Agents Learn Through "Sleep"
- 问题:记忆库会积累重复、过时、矛盾的信息,Agent无法感知
- Problem: Memory accumulates duplicates, stale entries, and contradictions that agents cannot perceive
- 解法:定时触发异步任务,读取100个会话+记忆库,生成新记忆库
- Solution: Scheduled async task that reads 100 sessions + memory store, generates new memory
- 三步操作:合并重复项、替换过时条目、挖掘宏观规律
- Three steps: Merge duplicates, replace stale entries, discover hidden patterns
- 效果:Harvey法律AI任务完成率提升6倍
- Impact: Harvey Legal AI saw 6x improvement in task completion rate
2. Outcomes:独立评分官机制
2. Outcomes: Independent Evaluator Mechanism
- 问题:Agent倾向于认为自己做完了,质量评估不客观
- Problem: Agents tend to think they're done, quality assessment is not objective
- 解法:独立评估器(Grader)在独立上下文中打分
- Solution: Independent evaluator (Grader) scores in its own context
- 效果:任务成功率+10%,docx准确率+8.4%,pptx准确率+10.1%
- Impact: +10% task success rate, +8.4% docx accuracy, +10.1% pptx accuracy
3. Multi-Agent:主从协作模式
3. Multi-Agent: Lead-Sub Architecture
- 架构:Lead Agent + Specialist Sub-Agents
- Architecture: Lead Agent + Specialist Sub-Agents
- 特点:并行工作、共享文件系统、全流程可追溯
- Features: Parallel work, shared filesystem, full traceability
- 案例:月球着陆模拟从67%提升到100%
- Case: Moon landing simulation improved from 67% to 100%
一、技术背景与问题定义
I. Technical Background & Problem Definition
1.1 Claude Managed Agents的定位
1.1 Claude Managed Agents Positioning
Claude Managed Agents是Anthropic在2026年4月8日发布的托管智能体平台,旨在帮助开发者快速构建和部署AI Agent。核心卖点是"10倍加速"——通过模块化Agent模板、集成记忆存储和增强编排框架,让开发者无需自己搭建基础设施。
Claude Managed Agents is a managed agent platform launched by Anthropic on April 8, 2026, designed to help developers quickly build and deploy AI agents. The core value proposition is "10x acceleration" - modular agent templates, integrated memory stores, and enhanced orchestration frameworks eliminate infrastructure complexity.
| 维度 | Dimension | 传统方式 | Traditional | Managed Agents | |
|---|---|---|---|---|---|
| 开发周期 | Dev Cycle | 数周 | Weeks | 分钟级 | Minutes |
| 基础设施 | Infrastructure | 需自建 | Build | 托管服务 | Managed |
| 记忆管理 | Memory | 手动实现 | Manual | Memory Store |
1.2 三个核心痛点
1.2 Three Core Pain Points
痛点1:跨会话记忆衰退
Pain Point 1: Cross-session Memory Decay
任何长期运行的Agent都会遇到上下文膨胀问题:
Any long-running agent faces context window limitations:
- 同样的项目结构,被不同会话重复学习50次
- Same project structure learned 50 times across sessions
- 一个月前定的接口规范已更新,但记忆里还保留旧版本
- API spec updated last month but old version still in memory
- 跨项目的隐性模式从未被沉淀
- Cross-project patterns never consolidated
痛点2:输出质量不稳定
Pain Point 2: Unstable Output Quality
LLM Agent有一个天然bias:它倾向于认为自己做完了。会话内自评,本质上是"让被告当法官"。
LLM agents have a natural bias: they tend to think they're done. In-session self-assessment is essentially "letting the defendant be the judge."
痛点3:复杂任务超出单Agent能力
Pain Point 3: Complex Tasks Exceed Single Agent Capability
当任务涉及多领域知识或需要并行处理时,单Agent的局限明显。
Single agent limitations become apparent when tasks involve multiple domains or require parallel processing.
二、Dreaming:离线自我进化机制
II. Dreaming: Offline Self-Improvement Mechanism
2.1 核心原理
2.1 Core Principles
Dreaming的灵感来源于人类REM睡眠机制。白天大脑吸收原始信息存成短期记忆,夜间REM阶段把当天经历重放一遍,强化有价值的连接、丢弃无用信息、整合成长期记忆。
Dreaming's inspiration comes from human REM sleep mechanisms. During the day, the brain stores raw information as short-term memory; during REM sleep, it replays experiences, strengthens valuable connections, discards useless information, and consolidates long-term memory.
Dreaming surfaces patterns that a single agent can't see on its own, including recurring mistakes, workflows that agents converge on, and preferences shared across a team.
Dreaming reveals patterns that a single agent cannot see independently, including recurring mistakes, workflow convergence points, and team-wide preferences.
2.2 工作流程
2.2 Workflow
- 触发条件:定时/会话数阈值
- Trigger: Schedule or session count threshold
- 读取数据:现有Memory Store + 最多100个历史会话
- Read: Existing Memory Store + up to 100 past sessions
- 执行操作:
- 合并重复项
- 替换过时条目
- 挖掘宏观规律
- Execute:
- Merge duplicates
- Replace stale entries
- Discover hidden patterns
- 生成输出:新Memory Store(原始存储不修改)
- Output: New Memory Store (original unchanged)
- 审核应用:开发者审核或自动应用
- Review: Developer review or auto-apply
2.3 技术实现
2.3 Technical Implementation
from anthropic import Anthropic
client = Anthropic()
# 触发Dreaming
dream = client.beta.dreams.create(
model="claude-sonnet-4-6", # 或 claude-opus-4-7
memory_store="ms_abc123",
instructions="重点关注工具调用相关的模式"
)
# 等待完成
while dream.status in ("pending", "running"):
dream = client.beta.dreams.retrieve(dream.id)
# 获取输出的新Memory Store
new_store_id = dream.outputs[0].id
2.4 关键设计原则
2.4 Key Design Principles
1. 原始数据不修改
1. Original Data Never Modified
Dreaming永远不会修改输入的原始记忆库。它生成的是一个全新的Memory Store实例,开发者可以预览新旧差异,决定apply或discard。
Dreaming never modifies the original input memory store. It generates a new Memory Store instance; developers can preview the diff and decide whether to apply or discard.
2. 实时监控
2. Real-time Monitoring
Dream任务进入running状态后,开发者可以流式订阅事件流:实时看到AI正在读取哪条记忆、正在写入什么新条目,发现问题时可随时"叫醒"(取消)。
When a Dream task enters running state, developers can stream-subscribe to events: see in real-time what memory is being read and what new entries are being written, and "wake up" (cancel) if issues are found.
"你趴在AI的床边,看着它做梦。"
"You're lying beside the AI's bed, watching it dream."
2.5 真实案例:月球着陆任务
2.5 Real Case: Moon Landing Mission
初始场景:6个候选着陆点,第一轮跑完2个点坠毁
Initial: 6 candidate landing sites, 2 crashed in first run
触发Dreaming:选择Opus 4.7模型,点击"Start dreaming"
Trigger Dreaming: Select Opus 4.7 model, click "Start dreaming"
执行结果:
- 耗时:8分钟
- 消化:530万token历史会话
- 输出:98行《Lumara Descent Commander's Playbook》
Result:
- Duration: 8 minutes
- Processed: 5.3 million tokens
- Output: 98 lines "Lumara Descent Commander's Playbook"
第二天验证:原来失败的2个站点全部修复,整体安全评分从67%提升到100%
Day 2 Verification: Previously failed 2 sites all fixed, overall safety score improved from 67% to 100%
三、Outcomes:独立评估器机制
III. Outcomes: Independent Evaluator Mechanism
3.1 核心问题:模型自评的bias
3.1 Core Problem: Self-Evaluation Bias
LLM Agent有一个天然bias:它倾向于认为自己做完了。传统评估方式的局限是Agent在自己的上下文中评估,受限于自己的推理过程。
LLM agents have a natural bias: they tend to think they're done. Traditional evaluation is limited because agents assess within their own context, influenced by their own reasoning.
本质上是让被告当法官。
Essentially letting the defendant be the judge.
3.2 解法:独立评估器
3.2 Solution: Independent Evaluator
Outcomes的解法:派一个独立的Evaluator Agent来当法官。
Outcomes solution: Deploy an independent Evaluator Agent as the judge.
- 开发者定义评分标准(rubric)
- Developer defines scoring rubric
- 系统分配独立评估器,在自己的上下文中评估
- System assigns independent evaluator with its own context
- 逐条打分,不通过的返回修改
- Score item by item, return for revision if failed
- 迭代直到达标或达到最大次数(默认3次,最高20次)
- Iterate until passing or max iterations (default 3, max 20)
3.3 评分标准示例
3.3 Rubric Example
# 文档生成评分标准
## 功能完整性
- [ ] CSV文件包含price列
- [ ] price列值为数值类型
- [ ] 数据行数不少于100条
## 格式规范
- [ ] 文件编码为UTF-8
- [ ] 列头名称正确
- [ ] 无空行
## 质量要求
- [ ] 无重复数据
- [ ] 无缺失值
- [ ] 数据在合理范围内
3.4 效果数据
3.4 Impact Data
| 指标 | 提升幅度 | Improvement |
|---|---|---|
| 整体任务成功率 | Overall Task Success | +10% |
| docx文档生成准确率 | +8.4% | |
| pptx幻灯片生成准确率 | +10.1% |
越难的问题,提升越明显——因为复杂任务更容易出现遗漏和错误,评估器的作用更大。
The harder the problem, the greater the improvement - complex tasks are more prone to omissions and errors, where the evaluator's value is greatest.
四、Multi-Agent Orchestration:多智能体协作
IV. Multi-Agent Orchestration: Multi-Agent Collaboration
4.1 架构设计
4.1 Architecture Design
Multi-Agent编排采用主从架构:
Multi-Agent orchestration uses a lead-sub architecture:
Lead Agent(主智能体)
├── 任务拆解
├── 子Agent调度
├── 上下文整合
└── 最终交付
Sub-Agents(子智能体)
├── Specialist A(专业领域A)
├── Specialist B(专业领域B)
└── ...
每个子Agent:
- 运行在独立线程
- 拥有独立上下文
- 可配置不同模型
- 可配置不同系统提示词
- 可配置专属工具集
Each Sub-Agent:
- Runs in isolated thread
- Has independent context
- Configurable model
- Configurable system prompt
- Configurable tools
4.2 真实案例
4.2 Real Cases
Harvey法律AI:任务完成率提升6倍
Harvey Legal AI: 6x improvement in task completion rate
Netflix日志分析:并行处理数百个构建来源,只呈现跨多应用的重复模式
Netflix Log Analysis: Parallel processing across hundreds of build sources, highlighting cross-application patterns
Spiral写作Agent(by Every):
- 主Agent运行在Haiku上:接收请求、追问澄清、调度写作
- 子Agent运行在Opus上:实际执行写作
- 使用Outcomes确保每份草稿达到编辑标准
Spiral Writing Agent (by Every):
- Lead agent on Haiku: receive requests, clarify, coordinate writing
- Sub-agents on Opus: execute actual writing
- Uses Outcomes to ensure every draft meets editorial standards
五、三件套组合效果
V. Combined Impact of the Three Features
5.1 互补关系
5.1 Complementary Relationship
| 功能 | 解决的问题 | Problem Solved | 状态 | |
|---|---|---|---|---|
| Dreaming | 跨会话记忆衰退 | Cross-session memory decay | 研究预览 | Research Preview |
| Outcomes | 输出质量不稳定 | Unstable output quality | 公测 | Public Beta |
| Multi-Agent | 复杂任务处理 | Complex task handling | 公测 | Public Beta |
5.2 价值演进
5.2 Value Evolution
| 阶段 | 能力 | Capability | |
|---|---|---|---|
| 第一阶段 | Phase 1 | 运行环境托管 | Runtime hosting |
| 第二阶段 | Phase 2 | Memory Store | |
| 第三阶段 | Phase 3 | 三件套:自主进化、质量保障、复杂任务处理 | Three features: self-improvement, quality assurance, complex tasks |
六、总结
VI. Summary
核心要点
Key Points
Anthropic给AI装上了REM睡眠,让Agent可以在"不工作"的时间里自主整理记忆、发现规律、自我进化。配合独立评分官和多Agent协作,把AI Agent从"能跑"推向"能用"。
Anthropic gave AI its own REM sleep, enabling agents to independently organize memories, discover patterns, and self-improve during "off" time. Combined with independent evaluators and multi-agent collaboration, this pushes AI agents from "can run" to "can use".
对看宝AI的启发
Implications for KanBao AI
- 记忆进化机制:当前知识库是静态的,需要引入动态优化机制
- Memory evolution: Current knowledge base is static, needs dynamic optimization
- 质量保障体系:当前依赖人工审核,考虑建立自动化评分标准
- Quality assurance: Currently relies on manual review, consider automated scoring standards
- 多Agent协作:当前是单Agent模式,未来考虑引入专家Agent协作
- Multi-agent collaboration: Currently single-agent mode, consider expert agent collaboration in the future
🔗 相关链接
🔗 Related Links
官方资源
Official Resources
- New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration
- Code with Claude DevDay
- Claude Platform Documentation
技术解读
Technical Analysis
学术背景
Academic Background
- Sleep-time Compute: Beyond Inference Scaling at Test-time - UC Berkeley / Letta