华为盘古Ultra MoE与国产算力突破深度研究

🎯 核心收获（Key Takeaways）

🎯 Key Takeaways

#	#	核心要点	Core Point
1	盘古Ultra MoE：718B总参/39B激活，稀疏比54:1	理解MoE如何实现"大模型小计算"	Understand how MoE achieves "large model, small computation"
2	DSSN+TinyInit：梯度突刺率下降51%	解决超大模型训练稳定性问题	Solve training stability issues for ultra-large models
3	EP-Group负载均衡：兼顾效率与专家特化	分布式训练的关键优化	Key optimization for distributed training
4	MTP多头扩展：投机推理接受长度提升38%	提升推理速度的工程实践	Engineering practice for improving inference speed
5	全流程昇腾NPU训练：国产算力突破	验证国产AI基础设施可行性	Validate the feasibility of domestic AI infrastructure

一、MoE架构基础与盘古Ultra MoE概述

1. MoE Architecture Basics and Pangu Ultra MoE Overview

混合专家模型（Mixture-of-Experts, MoE）是一种革命性的稀疏架构，它将传统"密集"模型的每次推理都动用全部参数，转变为根据输入动态激活部分专家。这种"条件计算"的思想使得模型总参数量可以极大增加，但实际计算量却能维持在较低水平。

Mixture-of-Experts (MoE) is a revolutionary sparse architecture that transforms the traditional "dense" model's approach of activating all parameters for every inference into dynamically activating only a subset of experts based on the input.

华为于2025年发布的盘古Ultra MoE是一款全流程在昇腾NPU上训练的准万亿MoE大模型，其核心参数如下：

Pangu Ultra MoE, released by Huawei in 2025, is a quasi-trillion parameter MoE large model trained entirely on Ascend NPUs. Its core parameters are as follows:

参数	Parameter	数值	说明	Description
总参数量	Total Parameters	718B	准万亿参数	Quasi-trillion parameters
激活参数量	Active Parameters	39B	实际计算量	Actual computation
路由专家数	Routing Experts	256个	专家网络数量	Number of expert networks
每token激活专家	Experts per Token	8个	稀疏激活策略	Sparse activation strategy
稀疏比	Sparsity Ratio	54:1	参数vs计算量比值	Parameter vs computation ratio

这个稀疏比意味着：模型总参数量是实际激活参数量的约18倍，但推理时的计算成本仅与激活39B参数相当。这正是MoE架构的核心价值——在不增加计算成本的前提下，大幅提升模型容量。

This sparsity ratio means: the total model parameters are about 18 times the active parameters, but the inference computational cost is equivalent to activating only 39B parameters. This is the core value of the MoE architecture - dramatically increasing model capacity without increasing computational costs.

二、模型架构设计

2. Model Architecture Design

2.1 超大规模与超高稀疏比

2.1 Ultra-Scale and Ultra-High Sparsity

盘古Ultra MoE采用256个路由专家，每个token激活8个专家。这种设计在保证模型容量的同时，通过稀疏激活控制了计算成本。相比之下，GPT-4据传采用16个专家、激活2个的方案。盘古Ultra的稀疏度更高，但通过精细的负载均衡设计避免了效率损失。

Pangu Ultra MoE uses 256 routing experts, activating 8 experts per token. This design ensures model capacity while controlling computational costs through sparse activation.

2.2 Multi-head Latent Attention（MLA）

2.2 Multi-head Latent Attention (MLA)

MLA是华为自研的注意力机制，它通过低秩分解压缩KV Cache空间，有效缓解推理阶段的内存带宽瓶颈。相比传统的Grouped Query Attention（GQA），MLA在压缩率和效果之间取得了更好的平衡。这对于718B规模的大模型推理至关重要。

MLA is Huawei's self-developed attention mechanism that compresses KV Cache space through low-rank decomposition, effectively alleviating memory bandwidth bottlenecks during inference.

2.3 Multi-Token Prediction（MTP）

2.3 Multi-Token Prediction (MTP)

MTP是实现多token投机推理的关键技术。盘古Ultra MoE的创新在于采用"单头训练、多头扩展"的策略：训练初期使用单头MTP以控制成本，训练后期再扩展为多头结构，实现多token投机推理能力。

MTP is key technology for implementing multi-token speculative decoding. Pangu Ultra MoE's innovation lies in adopting a "single-head training, multi-head expansion" strategy.

2.4 昇腾亲和设计

2.4 Ascend-Affinity Design

设计要素	Design Element	具体实现	Implementation	效果
隐藏维度	Hidden Dimension	7680维	匹配DaVinci芯片16×16 MatMul单元	Matches DaVinci 16×16 MatMul unit
层数	Layers	61层	亲和流水线并行调度	Pipeline parallelism affinity
专家规模	Expert Scale	256 (2^8)	提升All-to-All通信效率	Improves All-to-All communication efficiency

三、训练稳定性突破：DSSN与TinyInit

3. Training Stability Breakthrough: DSSN and TinyInit

3.1 问题背景

3.1 Problem Background

训练超大规模MoE模型极具挑战。频繁的梯度范数突刺已成为阻碍收敛效率与模型性能提升的主要瓶颈。梯度突刺会导致训练不稳定，严重时甚至会使模型无法收敛。

Training ultra-large MoE models is extremely challenging. Frequent gradient norm spikes have become a major bottleneck hindering convergence efficiency and model performance improvement.

3.2 DSSN（Depth-Scaled Sandwich-Norm）

3.2 DSSN (Depth-Scaled Sandwich-Norm)

DSSN通过在每个子层输出后加入额外的层归一化，并引入深度缩放的初始化方式，从而稳定网络各层的输出尺度，达到抑制梯度异常、降低范数波动的目的。简单理解：DSSN就像在每层之间加了"稳定器"，防止梯度在传播过程中"爆炸"或"消失"。

DSSN adds extra layer normalization after each sub-layer output and introduces depth-scaled initialization, thereby stabilizing the output scale of each network layer.

3.3 TinyInit

TinyInit提出一种标准差为√1/2dL的初始化方案，能够同时兼顾模型深度与宽度。其中d表示隐藏维度，L表示模型层数。神经网络的训练本质上是一个优化过程，初始化的好坏直接影响到优化的难度。

TinyInit proposes an initialization scheme with standard deviation √1/2dL, which can balance model depth and width simultaneously.

                效果验证：
                DSSN+TinyInit使梯度突刺率从1.54%下降到0.76%，相对下降51%。支撑了10+T tokens数据的长稳训练。
                
                Effect Verification:
                DSSN+TinyInit reduced gradient spike rate from 1.54% to 0.76%, a relative decrease of 51%. Supported 10+ T tokens of long-term stable training.

四、专家负载均衡：EP-Group Auxiliary Loss

4. Expert Load Balancing: EP-Group Auxiliary Loss

4.1 问题背景

4.1 Problem Background

在训练MoE模型时，容易出现专家负载不均衡的情况。当采用专家并行策略（EP，Expert Parallelism）时，负载不均衡会影响计算效率：被分配过多token的专家会成为计算瓶颈，而其他专家则处于低利用率状态。

During MoE model training, expert load imbalance can easily occur. When using expert parallelism (EP), load imbalance affects computational efficiency.

4.2 EP-Group创新方案

4.2 EP-Group Innovation

盘古团队设计的EP-Group均衡loss，约束EP组内所有micro batch路由到组内专家之后的均衡性。这相当于把EP组内部的所有micro batch联合起来计算负载均衡的loss，容忍单个micro batch的不均衡，只要多个micro batch的token路由到专家之后是均衡的即可。

The EP-Group balancing loss designed by the Pangu team constrains the balance of all micro batches routed to experts within the EP group.

实验结果显示：EP-Group均衡loss在大部分任务相比主流的Micro-batch方案都有显著优势，平均提升了1.5个点。同时，专家特化分析显示，不同领域的数据对专家的选择存在显著差异。

Experimental results show that EP-Group balancing loss has significant advantages over the mainstream Micro-batch approach in most tasks, improving by an average of 1.5 points.

五、MTP多头扩展：投机推理优化

5. MTP Multi-Head Extension: Speculative Decoding Optimization

5.1 投机推理原理

5.1 Speculative Decoding Principle

投机推理是一种提升大模型生成效率的有效方法。核心思想是：在主模型生成token之前，由一个轻量辅助模块预先预测多个候选token，并通过快速校验机制决定是否接纳，从而实现推理过程的并行化与加速。

Speculative decoding is an effective method to improve large model generation efficiency. The core idea is to pre-predict multiple candidate tokens with a lightweight auxiliary module before the main model generates tokens.

5.2 后期扩展策略

5.2 Late-Stage Expansion Strategy

盘古团队发现，获取多token的投机推理能力并不需要从训练开始便配置多个MTP头，可以在训练后期对单头MTP进行扩展来达到类似效果。

The Pangu team discovered that obtaining multi-token speculative decoding capability does not require configuring multiple MTP heads from the start of training.

                实验结果：
                双头扩增模型的接受长度和延迟基本和双头从头训练一致
双头的接受长度约2.30，单头的接受长度约1.67
双头相对单头提升约38%

                

                Experimental Results:
                The acceptance length and latency of the dual-head expansion model are basically consistent with dual-head training from scratch
Dual-head acceptance length is about 2.30, single-head is about 1.67
Dual-head improvement over single-head is about 38%

            

六、国产算力突破

6. Domestic Computing Power Breakthrough

6.1 全流程昇腾NPU训练

6.1 Full-Stack Ascend NPU Training

盘古Ultra MoE最引人注目的突破之一是全流程在昇腾NPU上训练。这意味着从预训练到后训练，整个流程都运行在华为自研的昇腾AI处理器上，不依赖任何国外芯片。

One of the most notable breakthroughs of Pangu Ultra MoE is full-stack training on Ascend NPUs. This means the entire process from pre-training to post-training runs on Huawei's self-developed Ascend AI processors.

指标	Metric	数值
预训练NPU规模	Pre-training NPU Scale	6k-10k张
训练数据量	Training Data	10+T tokens
长序列能力	Long Sequence Capability	128k

6.2 V1.1版本重大升级

6.2 V1.1 Major Upgrade

能力	Capability	V1.0	V1.1	提升	Improvement
幻觉率（快思考）	Hallucination Rate	10.11%	3.85%	大幅降低	Significantly reduced
工具调用（慢思考）	Tool Calling	55.8	68.0	+12.2
数学快思考（AIME24）	Math Fast Thinking	56.25	66.04	+9.79
代码慢思考	Code Slow Thinking	61.1	65.7	+4.6

七、对知识库的启发

7. Insights for Knowledge Base

7.1 知识蒸馏与模型压缩

7.1 Knowledge Distillation and Model Compression

盘古Ultra MoE的Int8量化实践（显存减半、吞吐提升20%）为知识库系统的模型部署提供了参考。对于知识库检索场景，可以在精度损失可接受的范围内，使用量化技术降低部署成本。

Pangu Ultra MoE's Int8 quantization practice provides a reference for knowledge base system model deployment.

7.2 快慢思考的工程价值

7.2 Engineering Value of Fast/Slow Thinking

盘古Ultra MoE支持的快慢思考双模式，对知识库交互设计有启发：

快思考：适用于简单查询，直接给出答案
慢思考：适用于复杂问题，进行多步推理

The fast/slow thinking dual-mode supported by Pangu Ultra MoE provides inspiration for knowledge base interaction design.

7.3 国产化部署参考

7.3 Domestic Deployment Reference

盘古Ultra MoE的开源（GitCode平台）以及昇腾NPU的适配，为知识库的国产化部署提供了参考。考虑到数据安全等因素，未来知识库可以考虑部署在国产算力平台上。

Pangu Ultra MoE's open source (GitCode platform) and Ascend NPU adaptation provide a reference for domestic deployment of knowledge bases.

🎯 核心收获（Key Takeaways）

🎯 Key Takeaways

一、MoE架构基础与盘古Ultra MoE概述

1. MoE Architecture Basics and Pangu Ultra MoE Overview

二、模型架构设计

2. Model Architecture Design

2.1 超大规模与超高稀疏比

2.1 Ultra-Scale and Ultra-High Sparsity

2.2 Multi-head Latent Attention（MLA）

2.2 Multi-head Latent Attention (MLA)

2.3 Multi-Token Prediction（MTP）

2.3 Multi-Token Prediction (MTP)

2.4 昇腾亲和设计

2.4 Ascend-Affinity Design

三、训练稳定性突破：DSSN与TinyInit

3. Training Stability Breakthrough: DSSN and TinyInit

3.1 问题背景

3.1 Problem Background

3.2 DSSN（Depth-Scaled Sandwich-Norm）

3.2 DSSN (Depth-Scaled Sandwich-Norm)

3.3 TinyInit

3.3 TinyInit

四、专家负载均衡：EP-Group Auxiliary Loss

4. Expert Load Balancing: EP-Group Auxiliary Loss

4.1 问题背景

4.1 Problem Background

4.2 EP-Group创新方案

4.2 EP-Group Innovation

五、MTP多头扩展：投机推理优化

5. MTP Multi-Head Extension: Speculative Decoding Optimization

5.1 投机推理原理

5.1 Speculative Decoding Principle

5.2 后期扩展策略

5.2 Late-Stage Expansion Strategy

六、国产算力突破

6. Domestic Computing Power Breakthrough

6.1 全流程昇腾NPU训练

6.1 Full-Stack Ascend NPU Training

6.2 V1.1版本重大升级

6.2 V1.1 Major Upgrade

七、对知识库的启发

7. Insights for Knowledge Base

7.1 知识蒸馏与模型压缩

7.1 Knowledge Distillation and Model Compression

7.2 快慢思考的工程价值

7.2 Engineering Value of Fast/Slow Thinking

7.3 国产化部署参考

7.3 Domestic Deployment Reference

八、相关链接

8. Related Links

官方资源

Official Resources

技术解读

Technical Analysis