📅 2026-04-25 🔗 技术解析 👤 常思杨

DeepSeek-V4 深度解析 DeepSeek-V4 Deep Dive

百万上下文时代的架构革命：稀疏注意力、流形约束、存算分离 Architecture Revolution in Million-Token Era: Sparse Attention, Manifold Constraints, Storage-Compute Separation

DeepSeek-V4 1M上下文 MoE 稀疏注意力架构创新

🎯 核心收获

1. 百万上下文成为标配 1. Million-Token Context as Default

V4全系默认1M上下文（100万token） V4 series defaults to 1M context (1 million tokens)
输出最大支持384K tokens Maximum output supports 384K tokens
一次可读完《三体》三部曲 Can read the entire Three-Body Problem trilogy at once

2. 架构效率的胜利 2. Victory of Architecture Efficiency

V4-Flash激活参数仅13B，却超过V3.2的37B V4-Flash with only 13B active params outperforms V3.2's 37B
1M上下文下KV Cache降至V3.2的7% KV Cache drops to 7% of V3.2 at 1M context
不是堆参数，是架构创新 Not parameter stacking, but architectural innovation

3. 三大原创技术 3. Three Original Technologies

CSA+HCA：混合压缩注意力Hybrid Compressed Attention
mHC：流形约束超连接Manifold-Constrained Hyper-Connections
Engram：存算分离记忆机制Storage-Compute Separation Memory

📖 正文内容

一、产品定位：双档策略

DeepSeek-V4发布两个版本，采用类似手机芯片的Pro/标准版策略：

DeepSeek-V4 releases two versions, adopting a strategy similar to smartphone chips (Pro/Standard):

规格	Spec	V4-Pro	V4-Flash
总参数	Total Params	1.6T	284B
激活参数	Active Params	49B	13B
上下文	Context	1M	1M
最大输出	Max Output	384K	384K
价格(输出)	Price (Output)	24元/百万token	2元/百万token

定价策略：

Pricing Strategy:

V4-Flash：极致性价比，简单任务与Pro相当
V4-Flash: Extreme cost-performance, comparable on simple tasks
V4-Pro：旗舰性能，接近Opus 4.6非思考模式
V4-Pro: Flagship performance, close to Opus 4.6 non-thinking mode

二、三大架构创新

2.1 CSA + HCA：混合压缩注意力

传统Attention的瓶颈 Traditional Attention Bottleneck ┌─────────────────────────────────────────────────────────┐ │ O(n²) 复杂度：序列翻倍，计算量翻4倍 │ │ Sequence doubles → Computation quadruples │ │ │ │ 1M token 的计算量是 128K 的 64倍 │ │ 1M token computation is 64× that of 128K │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ V4 解决方案：别再每个字平等地看，学会挑重点看 │ │ V4 Solution: Don't look at every token equally │ └─────────────────────────────────────────────────────────┘

CSA（压缩稀疏注意力）：

CSA (Compressed Sparse Attention):

每4个token的KV压缩成1个entry
Compress every 4 tokens' KV into 1 entry
使用Lightning Indexer稀疏选出top-k条深入计算
Use Lightning Indexer to sparsely select top-k entries
保留128个token的滑动窗口维持近距离细节
Keep 128-token sliding window for local details

HCA（重度压缩注意力）：

HCA (Heavily Compressed Attention):

更激进，每128个token压缩成1个
More aggressive, compress every 128 tokens into 1
不做稀疏，保持稠密注意力
No sparsity, maintain dense attention
负责超远距离的全局语义
Responsible for ultra-long-distance global semantics

效果对比	Efficiency Comparison	V4-Pro	V4-Flash
推理FLOPs	Inference FLOPs	V3.2的27%	V3.2的10%
KV Cache	KV Cache	V3.2的10%	V3.2的7%
对比GQA8基线	vs GQA8 Baseline	KV Cache仅为其2%

2.2 mHC：流形约束超连接

mHC核心思想 Core Idea 传统残差连接 → 信息在层层传递中指数级放大 Traditional Residual → Signal exponentially amplifies through layers mHC → 把残差映射矩阵约束到双随机矩阵流形 mHC → Constrain residual mapping to doubly-stochastic manifold 效果：谱范数≤1，保证非扩张特性 Result: Spectral norm ≤1, ensures non-expansion

解决的问题：深层网络训练容易"信号爆炸"，梯度异常导致训练崩溃。

Problem Solved: Deep networks prone to "signal explosion" during training.

核心思想：

Core Idea:

用Sinkhorn-Knopp算法投影
Project using Sinkhorn-Knopp algorithm
谱范数≤1，保证非扩张特性
Spectral norm ≤1, ensures non-expansion

效果：

Result:

信号增益从3000倍降到1.6倍
Signal gain drops from 3000× to 1.6×
训练稳定性大幅提升
Training stability significantly improved
训练时间仅增加6.7%
Training time only increases by 6.7%

2.3 Muon优化器

替换方案：

Replacement:

大部分模块从AdamW换成Muon
Most modules switch from AdamW to Muon
核心是用Newton-Schulz迭代做矩阵正交化
Core is Newton-Schulz iteration for matrix orthogonalization

效果：收敛速度提升20-30%

Result: Convergence speed improved by 20-30%

三、Engram：存算分离记忆机制

📌 2月论文揭示的突破：DeepSeek早在2月就发布了Engram论文，V4是首次落地旗舰模型

📌 Breakthrough from February Paper: DeepSeek published Engram paper in February, V4 is first flagship deployment

传统问题

传统Transformer的"记忆"困境 Traditional Transformer Memory Dilemma ┌─────────────────────────────────────────────────────────┐ │ 模型同时做"记忆"和"计算" │ │ Model does both "memory" and "computation" │ │ │ │ 问题："彭于晏"出现88次 → 计算88次 │ │ Problem: "Peng Yuyanuan" appears 88 times → compute 88×│ │ │ │ 每次推理都要全参数运转 │ │ Every inference loads all parameters │ └─────────────────────────────────────────────────────────┘

Engram解法

Engram双路径架构 Engram Dual-Path Architecture ┌─────────────────────────────────────────────────────────┐ │ 路径一:静态知识"字典"（Engram记忆表） │ │ Path 1: Static knowledge "dictionary" │ │ → 25%参数存到DRAM │ │ → 25% params stored in DRAM │ │ → O(1)哈希N-gram检索 │ │ → O(1) hash N-gram retrieval │ ├─────────────────────────────────────────────────────────┤ │ 路径二:动态推理"大脑"（MoE专家网络） │ │ Path 2: Dynamic reasoning "brain" (MoE) │ │ → 75%参数专注复杂推理 │ │ → 75% params for complex reasoning │ └─────────────────────────────────────────────────────────┘

效果：

Result:

指标	Metric	提升	Improvement
大海捞针准确率	Needle-in-Haystack	84.2% → 97%	+12.8%
吞吐量损失	Throughput Loss	<3%
GPU显存降低	GPU Memory	90%

四、后训练：OPD多教师蒸馏

V4两阶段方案 V4 Two-Stage Approach 阶段1:专家分训 Stage 1: Expert Training ┌──────────┐ ┌──────────┐ ┌──────────┐ │ 代码专家 │ │ 数学专家 │ │ 写作专家 │ │Code Expert│ │Math Expert│ │Write Expert│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ └────────────┼────────────┘ ↓ 阶段2:蒸馏合并 Stage 2: Distillation ┌──────────────────────────────────┐ │ OPD在线策略蒸馏 │ │ On-Policy Distillation │ └────────────────┬─────────────────┘ ↓ ┌────────────────┐ │ 统一学生模型 │ │ Unified Model │ └────────────────┘

传统方法的问题：SFT+RLHF混合训练，领域互相拉扯

Traditional Problem: SFT+RLHF mixed training, domains interfere

V4两阶段方案：

V4 Two-Stage Approach:

第一阶段：十几位专家各自用SFT+GRPO训练到极致，互不干扰
Stage 1: 10+ experts trained to excellence independently
第二阶段：OPD蒸馏到一个学生，融合十种专家能力
Stage 2: OPD distills into one student model

五、性能评测

基准 Benchmark	V4-Pro	Claude Opus 4.6	GPT-5.4
SimpleQA	57.9%	46.2%	45.3%
Codeforces	3206	-	-
Agentic Coding	Agentic Coding	开源最佳 Best Open Source

内部测试（30道真实Bug）：

Internal Test (30 Real Bugs):

V4-Pro: 67%
Sonnet 4.5: 47%
接近Opus 4.6非思考模式
Close to Opus 4.6 non-thinking mode

六、Agent能力深度优化

专门适配的Agent产品：

Specifically Optimized Agents:

Claude Code
OpenClaw
OpenCode
CodeBuddy

能力提升：

Capability Improvements:

代码任务、文档生成任务均有提升
Code tasks and document generation both improved
多轮工具调用更稳定
More stable multi-turn tool calling
长链路任务连贯性增强
Enhanced long-chain task coherence

七、华为昇腾深度适配

🚀 破天荒的合作：DeepSeek首次将华为昇腾写进技术报告

🚀 Historic Partnership: DeepSeek first includes Huawei Ascend in tech report

华为昇腾超节点全系列支持
Full Huawei Ascend supernode series support
昇腾950下半年批量上市
Ascend 950 batch availability in H2
Day 0级适配（发布当天）
Day 0 adaptation (on release day)

性能数据	Performance	V4-Pro	V4-Flash
低延迟	Low Latency	20ms	10ms
预计利用率	Est. Utilization	85%+

八、行业影响

1. 开源模型的里程碑

首次有开源模型在知识问答上领先闭源
First open source model to lead in knowledge QA
Agent能力达到开源最佳
Agent capabilities reach open source best
推理性能比肩顶级闭源
Reasoning performance on par with top closed-source

2. 成本革命

V4-Flash输出2元/百万token
V4-Flash output: 2 yuan per million tokens
约为GPT-5.5的1/50
~1/50 of GPT-5.5 price
长上下文成本降至十分之一
Long context cost reduced to 1/10

3. 架构创新范式

从"暴力堆参数"转向"精致工程"
From "brute force parameters" to "refined engineering"
mHC+Engram+MoE三重创新
Triple innovation: mHC+Engram+MoE
为后Scaling Law时代提供新路径
New path for post-Scaling Law era

💭 思考与实践

对看宝AI的启发

1. 知识库检索优化

Engram的存算分离思路
Engram's storage-compute separation concept
可以考虑冷热数据分离
Consider hot/cold data separation
热知识放内存，冷知识放磁盘
Hot knowledge in memory, cold knowledge on disk

2. 长上下文处理

1M上下文让知识库全量检索成为可能
1M context makes full knowledge base retrieval possible
可以设计"知识库即上下文"的Agent
Design "knowledge base as context" Agent

3. 成本意识

DeepSeek用架构创新降本
DeepSeek reduces cost through architectural innovation
知识库也可以用缓存、压缩等技术优化
Knowledge base can also optimize with caching, compression

下一步行动

[ ] 体验V4-Flash API
[ ] [ ] Experience V4-Flash API
[ ] 测试1M上下文能力
[ ] Test 1M context capability
[ ] 研究Engram开源实现
[ ] Research Engram open source implementation
[ ] 评估对知识库架构的优化空间
[ ] Evaluate optimization space for knowledge base architecture

🔗 相关链接

📚 术语表

术语	全称	解释
CSA	Compressed Sparse Attention	压缩稀疏注意力	Compressed Sparse Attention
HCA	Heavily Compressed Attention	重度压缩注意力	Heavily Compressed Attention
mHC	Manifold-Constrained Hyper-Connections	流形约束超连接	Manifold-Constrained Hyper-Connections
OPD	On-Policy Distillation	在线策略蒸馏	On-Policy Distillation
MoE	Mixture of Experts	混合专家	Mixture of Experts
KV Cache	Key-Value Cache	键值缓存	Key-Value Cache