# 训练 DeepSeek 671b

最后更新时间：2025 年 8 月 20 日。

verl 集成了 Megatron，支持 `Qwen3-235B-A22B` 和 `deepseek-ai/DeepSeek-V3` 等大型 MoE 模型。这是一项持续进行的社区努力。

在此过程中，社区为 verl 添加了以下功能和优化，使其能够支持更大的模型：
- 在 rollout 和训练之间进行逐张量权重重分片 (per tensor weight resharding)
- 通过 Megatron 实现上下文并行和专家并行 (context parallelism and expert parallelism)
- Megatron 的动态批次大小 (序列平衡) (dynamic batch size (sequence balance))
- 减少与 Ray 相关的序列化开销
- 优化器卸载、重计算和高效内核 (optimizer offloading, recomputation, and efficient kernels)
- 各种调试指标和实用工具
- 混合优化器 (hybrid optimizer)

现在 Megatron 后端支持的模型列表更广泛了：
- DeepSeek-V3
- Moonlight
- Qwen3
- Qwen2.5-VL (即将合并)
- Qwen2
- Mixtral

## 入门指南

### 准备工作
推荐使用的镜像包含预先构建的 Megatron 依赖，该镜像为 `verlai/verl:app-verl0.4-vllm0.8.5-mcore0.13.0-preview`。它使用了 [docker/verl0.4-cu124-torch2.6-fa2.7.4/Dockerfile.app.vllm.mcore0.13.preview](https://github.com/volcengine/verl/blob/main/docker/verl0.4-cu124-torch2.6-fa2.7.4/Dockerfile.app.vllm.mcore0.13.preview) 中的 Dockerfile 构建。

该镜像是在 Hopper GPU 上使用 DeepEP 构建的。它不支持非 Hopper GPU，例如 A100。您可能需要重新安装 DeepEP 以便与 A100 配合使用。

当 `OFFLOAD_FRACTION=1` 时，系统的最低要求会降低。对于 DeepSeek-V3，最少可以使用 96 个 H20 (96GB) GPU；对于 Qwen3-235B-A22B，最少可以使用 32 个 H20 (96GB) GPU。但是，此配置将每个节点使用 1.6TB 的 CPU 内存。如果 CPU 内存不足或需要更快的训练速度，可以添加更多节点。

### DeepSeek 671b

对于 DeepSeek-V3 671b，请参考 [examples/grpo_trainer/run_deepseek671b_math_megatron_96gb.sh](https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_deepseek671b_math_megatron_96gb.sh)。

在 RL 训练期间，MTP 和量化 (quantilization) 被禁用。

要训练您的项目，请根据可用 GPU 的数量配置以下环境变量。这些是推荐的设置，可以根据您的具体硬件进行调整。
| GPU 数量 | NNODES | TP | PP | EP | OFFLOAD_FRACTION | OFFLOAD_OPTIM | LAST_LAYER |
| -- | -- | -- | -- | -- | -- | -- | -- |
| 96 | 12 | 8 | 12 | 8 | 1. | False | 6 |
| 128 | 16 | 8 | 16 | 8 | 0.5 | True | 1 |
| 256 | 32 | 8 | 16 | 8 | 0. | True | 1 |
| 512 | 64 | 1 | 16 | 32 | 0 | True | 1 |

### Qwen3 235b

对于 Qwen3-235b，请参考 [examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh](https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh)。

要训练您的项目，请根据可用 GPU 的数量配置以下环境变量。这些是推荐的设置，可以根据您的具体硬件进行调整。
| GPU 数量 | NNODES | TP | PP | EP | OFFLOAD_FRACTION | OFFLOAD_OPTIM | LAST_LAYER |
| -- | -- | -- | -- | -- | -- | -- | -- |
| 32 | 4 | 4 | 8 | 4 | 1. | False | 6 |
| 64 | 8 | 4 | 8 | 4 | 0.5 | True | 6 |
| 128 | 16 | 4 | 8 | 4 | 0 | True | 6 |
| 256 | 32 | 4 | 8 | 4 | 0 | True | 6 |

### 基准测试
以下是 DeepSeek / Qwen3-235B 的一些基准测试结果。所有配置都与 GPU 数量的推荐设置相匹配。

| 模型 | GPU 数量 | 平均响应长度 | rollout 时间(秒) | GPU 内存(GB) | CPU 内存(GB) | MFU | 步进时间(秒) |
| -- | -- | -- | -- | -- | -- | -- | -- |
| DeepSeek 671b | 96 | 1960 | 1050 | 66 | 1500 | 0.19 | 1700 |

### Qwen3-30B-A3B MOE

对于 Qwen3-30b，请参考 [examples/grpo_trainer/run_qwen3moe-30b_megatron_96gb.sh](https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen3moe-30b_megatron_96gb.sh)。

要训练您的项目，请根据可用 GPU 的数量配置以下环境变量。这些是推荐的设置，可以根据您的具体硬件进行调整。
| GPU 数量 | NNODES | TP | PP | EP | OFFLOAD_FRACTION | OFFLOAD_OPTIM | MFU |
| -- | -- | -- | -- | -- | -- | -- | -- |
| 8 | 1 | 1 | 1 | 8 | 1. | True | 0.4 |
| 16 | 2 | 1 | 1 | 8 | 1. | True | 0.37 |
| 32 | 4 | 1 | 1 | 8 | 1. | True | 0.31 |

## 即将进行的优化

社区将继续进一步优化大型 MoE 模型，目前正在进行的工作包括：
- 进一步优化内存消耗，并为各种机器类型提供推荐/调优的配置
- 优化长上下文 RL 训练性能
- 通过 SGLang x Megatron 提升性能

我们邀请社区一起尝试和改进 verl。您可以通过 [slack](https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA)/[wechat](https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/WeChat.JPG)/[Github issues](https://github.com/volcengine/verl/issues/708) 与我们联系！

## 致谢
@vermouth1992 @ISEEKYAN @ETOgaosion @yzlnew @ShareLer @BearBiscuit05 @ccclyu @ann-qin-lu @SwordFaith @zzong2006 @zhaochenyang20 @ocss884 @eric-haibin-lin @chenhaiq @techkang