verl 性能调优 AMD (ROCm Kernel)

最后更新日期: 04/25/2025.

为 vLLM 打补丁以启用 AMD GPU 的睡眠模式

默认情况下，verl 需要 vLLM 启用睡眠模式，该模式允许 vLLM 在 rollout 后将 GPU 内存卸载到 CPU 内存。然而，此功能仍在 vLLM 社区审查中。

要启用 vLLM 的睡眠模式，您可以首先使用社区打过补丁的代码（来自此 pull request）从相应 pull request 的源码构建 vLLM。在补丁合并到 vLLM 主分支后，您可以直接安装最新版本的 vLLM。

克隆 vLLM 仓库并使用以下命令进行构建：

git clone -b sleep_amd https://github.com/HollowMan6/vllm.git
cd vllm
sudo ln -sf /opt/rocm/lib/libamdhip64.so /usr/lib/libamdhip64.so
VLLM_TARGET_DEVICE=rocm ROCM_PATH=/opt/rocm/ VLLM_GPU_LANG=HIP SETUPTOOLS_SCM_PRETEND_VERSION=0.8.4.dev python3 setup.py develop

此外，请确保您的 Docker 镜像使用的 ROCm 版本大于或等于 ROCm 6.3.4，我们推荐使用 ROCm 6.4.0 以获得更好的性能（参见此评论）。

升级后，您可以通过运行以下测试代码（来自此评论）来验证睡眠模式是否已启用。

import torch
from vllm import LLM

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", enable_sleep_mode=True)

def run_inference(prompt):
        outputs = llm.generate(prompt)
        for output in outputs:
                prompt = output.prompt
                generated_text = output.outputs[0].text
                print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


print("CUDA Memory Usage (after inference):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

run_inference("San Francisco is")
llm.sleep()

print("CUDA Memory Usage (after sleep):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

llm.wake_up()

print("CUDA Memory Usage (after wakeup):")
torch.cuda.empty_cache()
print(f"{torch.cuda.memory_allocated()=}")

run_inference("Paris is")

如果睡眠模式已启用，您应该会看到内存使用量在睡眠后有所减少。

应用 vLLM 补丁并完成安装后，您可以在 verl 中启用睡眠模式以减少内存开销。这允许 verl 在 rollout 期间卸载未使用的 GPU 内存，从而显著降低长上下文训练或多节点强化学习期间的内存占用。

启用 CUDA Graph 并绕过 ROCm 相关问题

由于 ROCm 中可能存在的 CUDA graph 捕获问题，我们发现 vLLM 的 CUDA graph 功能在 verl 的 AMD 平台上的多节点 vLLM V1 模式下无法启用。这导致 rollout 性能显著下降。

我们的研究表明，ROCm 在尝试使用 CUDA graph 捕获大型批次时可能会触发意外崩溃。一个变通方法是修改 LLM 配置（来自此提交）。

self.inference_engine = LLM(
    model=model_path,
    enable_sleep_mode=True,
    tensor_parallel_size=tensor_parallel_size,
    distributed_executor_backend="external_launcher",
    dtype=config.dtype,
    enforce_eager=config.enforce_eager,
    gpu_memory_utilization=config.gpu_memory_utilization,
    disable_custom_all_reduce=True,
    disable_mm_preprocessor_cache=True,
    limit_mm_per_prompt=limit_mm_per_prompt,
    skip_tokenizer_init=False,
    max_model_len=max_model_len,
    load_format=load_format,
    disable_log_stats=config.disable_log_stats,
    max_num_batched_tokens=max_num_batched_tokens,
    enable_chunked_prefill=config.enable_chunked_prefill,
    enable_prefix_caching=True,
    trust_remote_code=trust_remote_code,
    # enable compilation config to bypass oom on rocm
    # change depends on your GPU memory size
    compilation_config={"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32, 64]},
    seed=config.get('seed', 0),
)

然后，您可以通过设置以下环境变量来选择启用 CUDA graph（参见此页面）：

actor_rollout_ref.rollout.enforce_eager=False \