DeepSeek-V4 深度解析 DeepSeek-V4 Deep Dive

百万上下文时代的架构革命:稀疏注意力、流形约束、存算分离 Architecture Revolution in Million-Token Era: Sparse Attention, Manifold Constraints, Storage-Compute Separation

DeepSeek-V4 1M上下文 MoE 稀疏注意力 架构创新

🎯 核心收获

1. 百万上下文成为标配 1. Million-Token Context as Default

  • V4全系默认1M上下文(100万token) V4 series defaults to 1M context (1 million tokens)
  • 输出最大支持384K tokens Maximum output supports 384K tokens
  • 一次可读完《三体》三部曲 Can read the entire Three-Body Problem trilogy at once

2. 架构效率的胜利 2. Victory of Architecture Efficiency

  • V4-Flash激活参数仅13B,却超过V3.2的37B V4-Flash with only 13B active params outperforms V3.2's 37B
  • 1M上下文下KV Cache降至V3.2的7% KV Cache drops to 7% of V3.2 at 1M context
  • 不是堆参数,是架构创新 Not parameter stacking, but architectural innovation

3. 三大原创技术 3. Three Original Technologies

  • CSA+HCA混合压缩注意力Hybrid Compressed Attention
  • mHC流形约束超连接Manifold-Constrained Hyper-Connections
  • Engram存算分离记忆机制Storage-Compute Separation Memory

📖 正文内容

一、产品定位:双档策略

DeepSeek-V4发布两个版本,采用类似手机芯片的Pro/标准版策略:

DeepSeek-V4 releases two versions, adopting a strategy similar to smartphone chips (Pro/Standard):

规格 Spec V4-Pro V4-Flash
总参数 Total Params 1.6T 284B
激活参数 Active Params 49B 13B
上下文 Context 1M 1M
最大输出 Max Output 384K 384K
价格(输出) Price (Output) 24元/百万token 2元/百万token

定价策略

Pricing Strategy:

  • V4-Flash:极致性价比,简单任务与Pro相当
  • V4-Flash: Extreme cost-performance, comparable on simple tasks
  • V4-Pro:旗舰性能,接近Opus 4.6非思考模式
  • V4-Pro: Flagship performance, close to Opus 4.6 non-thinking mode

二、三大架构创新

2.1 CSA + HCA:混合压缩注意力

传统Attention的瓶颈 Traditional Attention Bottleneck ┌─────────────────────────────────────────────────────────┐ │ O(n²) 复杂度:序列翻倍,计算量翻4倍 │ │ Sequence doubles → Computation quadruples │ │ │ │ 1M token 的计算量是 128K 的 64倍 │ │ 1M token computation is 64× that of 128K │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ V4 解决方案:别再每个字平等地看,学会挑重点看 │ │ V4 Solution: Don't look at every token equally │ └─────────────────────────────────────────────────────────┘

CSA(压缩稀疏注意力)

CSA (Compressed Sparse Attention):

  • 每4个token的KV压缩成1个entry
  • Compress every 4 tokens' KV into 1 entry
  • 使用Lightning Indexer稀疏选出top-k条深入计算
  • Use Lightning Indexer to sparsely select top-k entries
  • 保留128个token的滑动窗口维持近距离细节
  • Keep 128-token sliding window for local details

HCA(重度压缩注意力)

HCA (Heavily Compressed Attention):

  • 更激进,每128个token压缩成1个
  • More aggressive, compress every 128 tokens into 1
  • 不做稀疏,保持稠密注意力
  • No sparsity, maintain dense attention
  • 负责超远距离的全局语义
  • Responsible for ultra-long-distance global semantics
效果对比 Efficiency Comparison V4-Pro V4-Flash
推理FLOPs Inference FLOPs V3.2的27% V3.2的10%
KV Cache KV Cache V3.2的10% V3.2的7%
对比GQA8基线 vs GQA8 Baseline KV Cache仅为其2%

2.2 mHC:流形约束超连接

mHC核心思想 Core Idea 传统残差连接 → 信息在层层传递中指数级放大 Traditional Residual → Signal exponentially amplifies through layers mHC → 把残差映射矩阵约束到双随机矩阵流形 mHC → Constrain residual mapping to doubly-stochastic manifold 效果:谱范数≤1,保证非扩张特性 Result: Spectral norm ≤1, ensures non-expansion

解决的问题:深层网络训练容易"信号爆炸",梯度异常导致训练崩溃。

Problem Solved: Deep networks prone to "signal explosion" during training.

核心思想

Core Idea:

  • 用Sinkhorn-Knopp算法投影
  • Project using Sinkhorn-Knopp algorithm
  • 谱范数≤1,保证非扩张特性
  • Spectral norm ≤1, ensures non-expansion

效果

Result:

  • 信号增益从3000倍降到1.6倍
  • Signal gain drops from 3000× to 1.6×
  • 训练稳定性大幅提升
  • Training stability significantly improved
  • 训练时间仅增加6.7%
  • Training time only increases by 6.7%

2.3 Muon优化器

替换方案

Replacement:

  • 大部分模块从AdamW换成Muon
  • Most modules switch from AdamW to Muon
  • 核心是用Newton-Schulz迭代做矩阵正交化
  • Core is Newton-Schulz iteration for matrix orthogonalization

效果:收敛速度提升20-30%

Result: Convergence speed improved by 20-30%

三、Engram:存算分离记忆机制

📌 2月论文揭示的突破:DeepSeek早在2月就发布了Engram论文,V4是首次落地旗舰模型

📌 Breakthrough from February Paper: DeepSeek published Engram paper in February, V4 is first flagship deployment

传统问题

传统Transformer的"记忆"困境 Traditional Transformer Memory Dilemma ┌─────────────────────────────────────────────────────────┐ │ 模型同时做"记忆"和"计算" │ │ Model does both "memory" and "computation" │ │ │ │ 问题:"彭于晏"出现88次 → 计算88次 │ │ Problem: "Peng Yuyanuan" appears 88 times → compute 88×│ │ │ │ 每次推理都要全参数运转 │ │ Every inference loads all parameters │ └─────────────────────────────────────────────────────────┘

Engram解法

Engram双路径架构 Engram Dual-Path Architecture ┌─────────────────────────────────────────────────────────┐ │ 路径一:静态知识"字典"(Engram记忆表) │ │ Path 1: Static knowledge "dictionary" │ │ → 25%参数存到DRAM │ │ → 25% params stored in DRAM │ │ → O(1)哈希N-gram检索 │ │ → O(1) hash N-gram retrieval │ ├─────────────────────────────────────────────────────────┤ │ 路径二:动态推理"大脑"(MoE专家网络) │ │ Path 2: Dynamic reasoning "brain" (MoE) │ │ → 75%参数专注复杂推理 │ │ → 75% params for complex reasoning │ └─────────────────────────────────────────────────────────┘

效果

Result:

指标 Metric 提升 Improvement
大海捞针准确率 Needle-in-Haystack 84.2% → 97% +12.8%
吞吐量损失 Throughput Loss <3%
GPU显存降低 GPU Memory 90%

四、后训练:OPD多教师蒸馏

V4两阶段方案 V4 Two-Stage Approach 阶段1:专家分训 Stage 1: Expert Training ┌──────────┐ ┌──────────┐ ┌──────────┐ │ 代码专家 │ │ 数学专家 │ │ 写作专家 │ │Code Expert│ │Math Expert│ │Write Expert│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ └────────────┼────────────┘ ↓ 阶段2:蒸馏合并 Stage 2: Distillation ┌──────────────────────────────────┐ │ OPD在线策略蒸馏 │ │ On-Policy Distillation │ └────────────────┬─────────────────┘ ↓ ┌────────────────┐ │ 统一学生模型 │ │ Unified Model │ └────────────────┘

传统方法的问题:SFT+RLHF混合训练,领域互相拉扯

Traditional Problem: SFT+RLHF mixed training, domains interfere

V4两阶段方案

V4 Two-Stage Approach:

  • 第一阶段:十几位专家各自用SFT+GRPO训练到极致,互不干扰
  • Stage 1: 10+ experts trained to excellence independently
  • 第二阶段:OPD蒸馏到一个学生,融合十种专家能力
  • Stage 2: OPD distills into one student model

五、性能评测

基准 Benchmark V4-Pro Claude Opus 4.6 GPT-5.4
SimpleQA 57.9% 46.2% 45.3%
Codeforces 3206 - -
Agentic Coding Agentic Coding 开源最佳
Best Open Source

内部测试(30道真实Bug)

Internal Test (30 Real Bugs):

  • V4-Pro: 67%
  • Sonnet 4.5: 47%
  • 接近Opus 4.6非思考模式
  • Close to Opus 4.6 non-thinking mode

六、Agent能力深度优化

专门适配的Agent产品

Specifically Optimized Agents:

  • Claude Code
  • OpenClaw
  • OpenCode
  • CodeBuddy

能力提升

Capability Improvements:

  • 代码任务、文档生成任务均有提升
  • Code tasks and document generation both improved
  • 多轮工具调用更稳定
  • More stable multi-turn tool calling
  • 长链路任务连贯性增强
  • Enhanced long-chain task coherence

七、华为昇腾深度适配

🚀 破天荒的合作:DeepSeek首次将华为昇腾写进技术报告

🚀 Historic Partnership: DeepSeek first includes Huawei Ascend in tech report

  • 华为昇腾超节点全系列支持
  • Full Huawei Ascend supernode series support
  • 昇腾950下半年批量上市
  • Ascend 950 batch availability in H2
  • Day 0级适配(发布当天)
  • Day 0 adaptation (on release day)
性能数据 Performance V4-Pro V4-Flash
低延迟 Low Latency 20ms 10ms
预计利用率 Est. Utilization 85%+

八、行业影响

1. 开源模型的里程碑

  • 首次有开源模型在知识问答上领先闭源
  • First open source model to lead in knowledge QA
  • Agent能力达到开源最佳
  • Agent capabilities reach open source best
  • 推理性能比肩顶级闭源
  • Reasoning performance on par with top closed-source

2. 成本革命

  • V4-Flash输出2元/百万token
  • V4-Flash output: 2 yuan per million tokens
  • 约为GPT-5.5的1/50
  • ~1/50 of GPT-5.5 price
  • 长上下文成本降至十分之一
  • Long context cost reduced to 1/10

3. 架构创新范式

  • 从"暴力堆参数"转向"精致工程"
  • From "brute force parameters" to "refined engineering"
  • mHC+Engram+MoE三重创新
  • Triple innovation: mHC+Engram+MoE
  • 为后Scaling Law时代提供新路径
  • New path for post-Scaling Law era

💭 思考与实践

对看宝AI的启发

1. 知识库检索优化

  • Engram的存算分离思路
  • Engram's storage-compute separation concept
  • 可以考虑冷热数据分离
  • Consider hot/cold data separation
  • 热知识放内存,冷知识放磁盘
  • Hot knowledge in memory, cold knowledge on disk

2. 长上下文处理

  • 1M上下文让知识库全量检索成为可能
  • 1M context makes full knowledge base retrieval possible
  • 可以设计"知识库即上下文"的Agent
  • Design "knowledge base as context" Agent

3. 成本意识

  • DeepSeek用架构创新降本
  • DeepSeek reduces cost through architectural innovation
  • 知识库也可以用缓存、压缩等技术优化
  • Knowledge base can also optimize with caching, compression

下一步行动

  • [ ] 体验V4-Flash API
  • [ ] [ ] Experience V4-Flash API
  • [ ] 测试1M上下文能力
  • [ ] Test 1M context capability
  • [ ] 研究Engram开源实现
  • [ ] Research Engram open source implementation
  • [ ] 评估对知识库架构的优化空间
  • [ ] Evaluate optimization space for knowledge base architecture

🔗 相关链接

📚 术语表

术语 全称 解释
CSA Compressed Sparse Attention 压缩稀疏注意力 Compressed Sparse Attention
HCA Heavily Compressed Attention 重度压缩注意力 Heavily Compressed Attention
mHC Manifold-Constrained Hyper-Connections 流形约束超连接 Manifold-Constrained Hyper-Connections
OPD On-Policy Distillation 在线策略蒸馏 On-Policy Distillation
MoE Mixture of Experts 混合专家 Mixture of Experts
KV Cache Key-Value Cache 键值缓存 Key-Value Cache