常见问题解答
====================================

最后更新：2025 年 9 月 24 日。

Ray 相关
------------

如何为分布式 Ray 添加断点进行调试？
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

请查看 Ray 官方的[调试指南](https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html)。


“无法将 worker 注册到 raylet”（"Unable to register worker with raylet"）
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

此问题是由于某些系统设置引起的，例如，SLURM 对节点上 CPU 的共享方式设置了一些限制。
当 `ray.init()` 尝试启动与机器 CPU 核心数量相等的 worker 进程时，
SLURM 的某些限制导致 `core-workers` 无法看到 `raylet` 进程，从而引发此问题。

要解决此问题，您可以将配置项 `ray_init.num_cpus` 设置为系统允许的某个数量。

分布式训练
------------------------

如何使用 Ray 进行多节点训练后运行？
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

您可以按照 Ray 官方指南 [starting a ray cluster](https://docs.ray.io/en/latest/ray-core/starting-ray.html) 启动一个 Ray 集群并提交一个 Ray 作业。

然后在配置中，将 `trainer.nnode` 配置设置为作业所需的机器数量。

如何在 Slurm 管理的集群上使用 verl？
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Ray 为用户提供了 [这个](https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html) 官方教程，用于在 Slurm 上启动 Ray 集群。我们已经在以下步骤的多节点设置下，在 Slurm 集群上对 `:doc:`GSM8K example<../examples/gsm8k_example>` 进行了验证。

1. [可选] 如果您的集群支持 `Apptainer 或 Singularity <https://apptainer.org/docs/user/main/>`_ 并且您希望使用它，请将 verl 的 Docker 镜像转换为 Apptainer 镜像。或者，使用集群上可用的包管理器设置环境，或使用其他容器运行时（例如通过 `Slurm's OCI support <https://slurm.schedmd.com/containers.html>`_）。

.. code:: bash

    apptainer pull /your/dest/dir/vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3.sif docker://verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3

2. 按照 :doc:`GSM8K example<../examples/gsm8k_example>` 准备数据集和模型检查点。

3. 根据您集群的实际信息修改 `examples/slurm/ray_on_slurm.slurm <https://github.com/volcengine/verl/blob/main/examples/slurm/ray_on_slurm.slurm>`_。

4. 使用 `sbatch` 将作业脚本提交到 Slurm 集群。

请注意，Slurm 集群的设置可能会有所不同。如果您遇到任何问题，请参阅 Ray 的 [Slurm 用户指南](https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html) 以获取常见注意事项。

如果您更改了 Slurm 资源规范，请确保在必要时更新作业脚本中的环境变量。


安装相关
------------------------

`NotImplementedError: TensorDict 不支持使用 'in' 关键字进行成员资格检查。`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

详细错误信息：

.. code:: bash

    NotImplementedError: TensorDict does not support membership checks with the `in` keyword. If you want to check if a particular key is in your TensorDict, please use `key in tensordict.keys()` instead.

问题原因：不存在适用于 linux-arm64 平台的 tensordict 包的合适版本。确认方法如下：

.. code:: bash

    pip install tensordict==0.6.2

输出示例：

.. code:: bash

    ERROR: Could not find a version that satisfies the requirement tensordict==0.6.2 (from versions: 0.0.1a0, 0.0.1b0, 0.0.1rc0, 0.0.2a0, 0.0.2b0, 0.0.3, 0.1.0, 0.1.1, 0.1.2, 0.8.0, 0.8.1, 0.8.2, 0.8.3)
    ERROR: No matching distribution found for tensordict==0.6.2

解决方案 1：
  从源代码安装 tensordict：

.. code:: bash

    pip uninstall tensordict
    git clone https://github.com/pytorch/tensordict.git
    cd tensordict/
    git checkout v0.6.2
    python setup.py develop
    pip install -v -e .

解决方案 2：
  暂时修改出错的代码：将 `tensordict_var` 改为 `tensordict_var.keys()`。


非法内存访问
---------------------------------

如果在 rollout 过程中遇到类似 ``CUDA error: an illegal memory access was encountered`` 的错误消息，请查阅 vLLM 文档，了解特定于您 vLLM 版本的故障排除步骤。

检查点
------------------------

如果您想将模型检查点转换为 huggingface safetensor 格式，请参考 ``verl/model_merger``。

Triton ``compile_module_from_src`` 错误
------------------------------------------------

如果您遇到类似下面堆栈跟踪的 triton 编译错误，请按照 [https://verl.readthedocs.io/en/latest/examples/config.html](https://verl.readthedocs.io/en/latest/examples/config.html) 设置 `use_torch_compile` 标志，以禁用即时编译（just-in-time compilation）以支持融合内核（fused kernels）。

.. code:: bash

  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 338, in run
    return self.fn.run(*args, **kwargs)
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/jit.py", line 607, in run
    device = driver.active.get_current_device()
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/driver.py", line 9, in _create_driver
    return actives[0]()
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
    self.utils = CudaUtils()  # TODO: make static
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
    so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
  File "/data/lbh/conda_envs/verl/lib/python3.10/site-packages/triton/runtime/build.py", line 48, in _build
    ret = subprocess.check_call(cc_cmd)
  File "/data/lbh/conda_envs/verl/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)

训练批次大小（train batch size）、小批次大小（mini batch size）和微批次大小（micro batch size）分别是多少？
------------------------------------------------------------------------------------------

下图说明了不同批次大小配置之间的关系。

https://excalidraw.com/#json=pfhkRmiLm1jnnRli9VFhb,Ut4E8peALlgAUpr7E5pPCA

.. image:: https://github.com/user-attachments/assets/16aebad1-0da6-4eb3-806d-54a74e712c2d

如何生成 Ray 时间线以分析训练作业的性能？
------------------------------------------------------------------------------------------

要生成 Ray 时间线文件，您可以将配置项 `ray_init.timeline_file` 设置为 JSON 文件路径。
例如：

.. code:: bash

    ray_init.timeline_file=/tmp/ray_timeline.json
  
该文件将在训练作业结束时生成在指定路径。
您可以使用 chrome://tracing 或 Perfetto UI 等工具查看 Ray 时间线文件。

下图展示了在 1 个节点 4 个 GPU 上进行的训练作业生成的 Ray 时间线文件。

.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray_timeline.png?raw=true

如何仅为 wandb 设置代理？
------------------------------------------------------------------------------------------

如果您需要代理才能访问 wandb，可以在训练作业脚本中添加以下配置。
与使用全局 `https_proxy` 环境变量相比，此方法不会干扰其他 HTTP 请求，例如 ChatCompletionScheduler。

.. code:: bash

  +trainer.wandb_proxy=http://<your proxy and port>

推理与训练序列不匹配（actor/grad_norm 过高）
------------------------------------------------------------------------------------------

如果在训练过程中遇到 actor/grad_norm 指标持续增大的问题，这可能是由于推理引擎和训练之间的精度存在显著不匹配。您可以使用以下参数进行确认：

.. code:: bash

    actor_rollout_ref.rollout.calculate_log_probs=True

此参数将添加诸如 `training/rollout_probs_diff_mean` 之类的指标，可用于验证推理和训练之间是否存在精度差异。

正常情况下，`training/rollout_probs_diff_mean` 的值应低于 0.005。如果您观察到此值高于 0.01，则表明推理引擎存在精度问题。
已知精度问题在以下情况下会发生：

1. 使用非 Hopper 架构的 GPU，例如 A100、L20、B200 等。

2. 使用 vLLM 作为推理引擎，并且它存在 [issue 22103](https://github.com/vllm-project/vllm/issues/22103)。

3. 输入和输出文本很长，例如在使用 Qwen3 等推理模型进行 RL 训练的多轮对话场景。

如果以上三个条件都满足，并且您观察到 `rollout_probs_diff_mean` 过高，建议添加以下参数以解决精度问题：

.. code:: bash

    +actor_rollout_ref.rollout.engine_kwargs.vllm.disable_cascade_attn=True

此问题的根本原因在于 vLLM 使用的 flash attention 中存在一个 bug。虽然该 bug 已修复，但在最新版本的 vLLM (v0.10.2) 中尚未发布。
有关此问题的更详细解释，请参阅 [Fix LSE output error in FA2 kv-split](https://github.com/vllm-project/flash-attention/pull/87)。

在 vLLM 发布包含此修复的新版本之前，建议使用上述配置禁用级联注意力（cascade attention）作为一种变通方法。