DeepSeek-V4 深度解析 DeepSeek-V4 Deep Dive
百万上下文时代的架构革命:稀疏注意力、流形约束、存算分离 Architecture Revolution in Million-Token Era: Sparse Attention, Manifold Constraints, Storage-Compute Separation
🎯 核心收获
1. 百万上下文成为标配 1. Million-Token Context as Default
- V4全系默认1M上下文(100万token) V4 series defaults to 1M context (1 million tokens)
- 输出最大支持384K tokens Maximum output supports 384K tokens
- 一次可读完《三体》三部曲 Can read the entire Three-Body Problem trilogy at once
2. 架构效率的胜利 2. Victory of Architecture Efficiency
- V4-Flash激活参数仅13B,却超过V3.2的37B V4-Flash with only 13B active params outperforms V3.2's 37B
- 1M上下文下KV Cache降至V3.2的7% KV Cache drops to 7% of V3.2 at 1M context
- 不是堆参数,是架构创新 Not parameter stacking, but architectural innovation
3. 三大原创技术 3. Three Original Technologies
- CSA+HCA:混合压缩注意力Hybrid Compressed Attention
- mHC:流形约束超连接Manifold-Constrained Hyper-Connections
- Engram:存算分离记忆机制Storage-Compute Separation Memory
📖 正文内容
一、产品定位:双档策略
DeepSeek-V4发布两个版本,采用类似手机芯片的Pro/标准版策略:
DeepSeek-V4 releases two versions, adopting a strategy similar to smartphone chips (Pro/Standard):
| 规格 | Spec | V4-Pro | V4-Flash |
|---|---|---|---|
| 总参数 | Total Params | 1.6T | 284B |
| 激活参数 | Active Params | 49B | 13B |
| 上下文 | Context | 1M | 1M |
| 最大输出 | Max Output | 384K | 384K |
| 价格(输出) | Price (Output) | 24元/百万token | 2元/百万token |
定价策略:
Pricing Strategy:
- V4-Flash:极致性价比,简单任务与Pro相当
- V4-Flash: Extreme cost-performance, comparable on simple tasks
- V4-Pro:旗舰性能,接近Opus 4.6非思考模式
- V4-Pro: Flagship performance, close to Opus 4.6 non-thinking mode
二、三大架构创新
2.1 CSA + HCA:混合压缩注意力
CSA(压缩稀疏注意力):
CSA (Compressed Sparse Attention):
- 每4个token的KV压缩成1个entry
- Compress every 4 tokens' KV into 1 entry
- 使用Lightning Indexer稀疏选出top-k条深入计算
- Use Lightning Indexer to sparsely select top-k entries
- 保留128个token的滑动窗口维持近距离细节
- Keep 128-token sliding window for local details
HCA(重度压缩注意力):
HCA (Heavily Compressed Attention):
- 更激进,每128个token压缩成1个
- More aggressive, compress every 128 tokens into 1
- 不做稀疏,保持稠密注意力
- No sparsity, maintain dense attention
- 负责超远距离的全局语义
- Responsible for ultra-long-distance global semantics
| 效果对比 | Efficiency Comparison | V4-Pro | V4-Flash |
|---|---|---|---|
| 推理FLOPs | Inference FLOPs | V3.2的27% | V3.2的10% |
| KV Cache | KV Cache | V3.2的10% | V3.2的7% |
| 对比GQA8基线 | vs GQA8 Baseline | KV Cache仅为其2% | |
2.2 mHC:流形约束超连接
解决的问题:深层网络训练容易"信号爆炸",梯度异常导致训练崩溃。
Problem Solved: Deep networks prone to "signal explosion" during training.
核心思想:
Core Idea:
- 用Sinkhorn-Knopp算法投影
- Project using Sinkhorn-Knopp algorithm
- 谱范数≤1,保证非扩张特性
- Spectral norm ≤1, ensures non-expansion
效果:
Result:
- 信号增益从3000倍降到1.6倍
- Signal gain drops from 3000× to 1.6×
- 训练稳定性大幅提升
- Training stability significantly improved
- 训练时间仅增加6.7%
- Training time only increases by 6.7%
2.3 Muon优化器
替换方案:
Replacement:
- 大部分模块从AdamW换成Muon
- Most modules switch from AdamW to Muon
- 核心是用Newton-Schulz迭代做矩阵正交化
- Core is Newton-Schulz iteration for matrix orthogonalization
效果:收敛速度提升20-30%
Result: Convergence speed improved by 20-30%
三、Engram:存算分离记忆机制
📌 2月论文揭示的突破:DeepSeek早在2月就发布了Engram论文,V4是首次落地旗舰模型
📌 Breakthrough from February Paper: DeepSeek published Engram paper in February, V4 is first flagship deployment
传统问题
Engram解法
效果:
Result:
| 指标 | Metric | 提升 | Improvement |
|---|---|---|---|
| 大海捞针准确率 | Needle-in-Haystack | 84.2% → 97% | +12.8% |
| 吞吐量损失 | Throughput Loss | <3% | |
| GPU显存降低 | GPU Memory | 90% | |
四、后训练:OPD多教师蒸馏
传统方法的问题:SFT+RLHF混合训练,领域互相拉扯
Traditional Problem: SFT+RLHF mixed training, domains interfere
V4两阶段方案:
V4 Two-Stage Approach:
- 第一阶段:十几位专家各自用SFT+GRPO训练到极致,互不干扰
- Stage 1: 10+ experts trained to excellence independently
- 第二阶段:OPD蒸馏到一个学生,融合十种专家能力
- Stage 2: OPD distills into one student model
五、性能评测
| 基准 Benchmark | V4-Pro | Claude Opus 4.6 | GPT-5.4 | |
|---|---|---|---|---|
| SimpleQA | 57.9% | 46.2% | 45.3% | |
| Codeforces | 3206 | - | - | |
| Agentic Coding | Agentic Coding | 开源最佳 Best Open Source |
||
内部测试(30道真实Bug):
Internal Test (30 Real Bugs):
- V4-Pro: 67%
- Sonnet 4.5: 47%
- 接近Opus 4.6非思考模式
- Close to Opus 4.6 non-thinking mode
六、Agent能力深度优化
专门适配的Agent产品:
Specifically Optimized Agents:
- Claude Code
- OpenClaw
- OpenCode
- CodeBuddy
能力提升:
Capability Improvements:
- 代码任务、文档生成任务均有提升
- Code tasks and document generation both improved
- 多轮工具调用更稳定
- More stable multi-turn tool calling
- 长链路任务连贯性增强
- Enhanced long-chain task coherence
七、华为昇腾深度适配
🚀 破天荒的合作:DeepSeek首次将华为昇腾写进技术报告
🚀 Historic Partnership: DeepSeek first includes Huawei Ascend in tech report
- 华为昇腾超节点全系列支持
- Full Huawei Ascend supernode series support
- 昇腾950下半年批量上市
- Ascend 950 batch availability in H2
- Day 0级适配(发布当天)
- Day 0 adaptation (on release day)
| 性能数据 | Performance | V4-Pro | V4-Flash |
|---|---|---|---|
| 低延迟 | Low Latency | 20ms | 10ms |
| 预计利用率 | Est. Utilization | 85%+ | |
八、行业影响
1. 开源模型的里程碑
- 首次有开源模型在知识问答上领先闭源
- First open source model to lead in knowledge QA
- Agent能力达到开源最佳
- Agent capabilities reach open source best
- 推理性能比肩顶级闭源
- Reasoning performance on par with top closed-source
2. 成本革命
- V4-Flash输出2元/百万token
- V4-Flash output: 2 yuan per million tokens
- 约为GPT-5.5的1/50
- ~1/50 of GPT-5.5 price
- 长上下文成本降至十分之一
- Long context cost reduced to 1/10
3. 架构创新范式
- 从"暴力堆参数"转向"精致工程"
- From "brute force parameters" to "refined engineering"
- mHC+Engram+MoE三重创新
- Triple innovation: mHC+Engram+MoE
- 为后Scaling Law时代提供新路径
- New path for post-Scaling Law era
💭 思考与实践
对看宝AI的启发
1. 知识库检索优化
- Engram的存算分离思路
- Engram's storage-compute separation concept
- 可以考虑冷热数据分离
- Consider hot/cold data separation
- 热知识放内存,冷知识放磁盘
- Hot knowledge in memory, cold knowledge on disk
2. 长上下文处理
- 1M上下文让知识库全量检索成为可能
- 1M context makes full knowledge base retrieval possible
- 可以设计"知识库即上下文"的Agent
- Design "knowledge base as context" Agent
3. 成本意识
- DeepSeek用架构创新降本
- DeepSeek reduces cost through architectural innovation
- 知识库也可以用缓存、压缩等技术优化
- Knowledge base can also optimize with caching, compression
下一步行动
- [ ] 体验V4-Flash API
- [ ] [ ] Experience V4-Flash API
- [ ] 测试1M上下文能力
- [ ] Test 1M context capability
- [ ] 研究Engram开源实现
- [ ] Research Engram open source implementation
- [ ] 评估对知识库架构的优化空间
- [ ] Evaluate optimization space for knowledge base architecture
🔗 相关链接
📚 术语表
| 术语 | 全称 | 解释 | |
|---|---|---|---|
| CSA | Compressed Sparse Attention | 压缩稀疏注意力 | Compressed Sparse Attention |
| HCA | Heavily Compressed Attention | 重度压缩注意力 | Heavily Compressed Attention |
| mHC | Manifold-Constrained Hyper-Connections | 流形约束超连接 | Manifold-Constrained Hyper-Connections |
| OPD | On-Policy Distillation | 在线策略蒸馏 | On-Policy Distillation |
| MoE | Mixture of Experts | 混合专家 | Mixture of Experts |
| KV Cache | Key-Value Cache | 键值缓存 | Key-Value Cache |