多轮 rollout 支持
==========================

最后更新：2025 年 6 月 27 日。

基本配置
~~~~~~~~~~~~~~~~~~~

要启用多轮 rollout，请确保在您的 rollout 配置中设置以下字段：

.. code-block:: yaml

    actor_rollout_ref:
        rollout:
            multi_turn: True
            name: "sglang"

此配置将激活 sglang 引擎，以便在 rollout 期间进行多轮交互。

自定义工具配置
~~~~~~~~~~~~~~~~~~~~~~~~~

对于自定义环境交互工具，您可以基于 ``verl.tools.base_tool.BaseTool`` 实现自己的工具。然后，在 YAML 文件中指定您的工具配置：

.. code-block:: yaml

    tools:
      - class_name: ""
        config:
            type: native
        tool_schema:

您可以参考 `GSM8KTool_example_configuration_ <https://github.com/volcengine/verl/blob/main/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml>`_，这是工具配置的一个示例。其实现可以在 `gsm8k_tool.py_ <https://github.com/volcengine/verl/blob/main/verl/tools/gsm8k_tool.py>`_ 中找到。

最后，在您的 rollout 配置中设置 ``tools_config_file``：

.. code-block:: yaml

    actor_rollout_ref:
        rollout:
            tool_kwargs:
                tools_config_file: <path_to_tool_yaml_file>

这允许在 actor rollout 步骤中集成定制化的工具行为。

如果您希望进行模拟交互的 rollout，可以在 rollout 配置中设置 ``interaction_config_file``：

.. code-block:: yaml

    interaction:
      - class_name: ""
        config: {}

.. code-block:: yaml

    actor_rollout_ref:
        rollout:
            interaction_config_file: <path_to_interaction_yaml_file>

如果您的工具创建了多模态输入，您应该在 tool.execute() 实现中返回一个多模态输入列表。

图像和视频应在返回前进行处理。例如，如果您使用的是 Qwen2.5-VL，可以使用以下代码获取表示：

.. code-block:: python

    async def create(self, ...) -> tuple[str, ToolResponse]:
        ...
        from verl.utils.dataset.vision_utils import process_image, process_video

        img1 = process_image(img1)
        video1 = process_video(video1)

        # 由于 vllm 中 (image | video) 键是 ("image" | "video") 而不是 ("images" | "videos")，
        # 我们需要使用 ("image" | "video") 来指定图像/视频列表
        # 链接: https://github.com/vllm-project/vllm/blob/3c545c0c3b98ee642373a308197d750d0e449403/vllm/multimodal/parse.py#L205
        return instance_id, ToolResponse(image=[img1, ...], video=[video1, ...], text="...")

    async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
        ...
        from verl.utils.dataset.vision_utils import process_image, process_video

        img1 = process_image(img1)
        video1 = process_video(video1)

        # 由于 vllm 中 (image | video) 键是 ("image" | "video") 而不是 ("images" | "videos")，
        # 我们需要使用 ("image" | "video") 来指定图像/视频列表
        # 链接: https://github.com/vllm-project/vllm/blob/3c545c0c3b98ee642373a308197d750d0e449403/vllm/multimodal/parse.py#L205
        return ToolResponse(image=[img1, ...], video=[video1, ...], text="..."), 0, {}

请记住在您的 dataset 配置中设置 ``return_multi_modal_inputs: False``，以便正确处理 rollout 中的多模态输入。
有关更多详细信息，请参阅 `Handling Multi-Modal Inputs in Datasets`_ 部分。

MCP 工具配置
~~~~~~~~~~~~~~~~~~~~~~

对于 MCP 交互工具，您可以使用 YAML 文件灵活配置它们。典型的设置如下：

.. code-block:: yaml

    tools:
      - class_name: ""
        config:
            type: mcp
        mcp:
            mcp_servers_config_path: ./mcp_server.json
            tool_selected_list: {}

``tool_selected_list`` 字段是可选的，它指定要从服务器使用的工具。如果您想启用所有可用工具，只需省略此属性即可。此外，``mcp_servers_config_path`` 指向一个包含 MCP 服务器配置的 JSON 文件。例如：

.. code-block:: json

      {
          "mcpServers": {
              "SSE Server": {
                  "url": "your_server_url",
                  "auth_token": "your_server_api_token"
              },
              "STDIO Server": {
                  "command": "npx",
                  "args": ["-y", "server-mcp@0.2.1"],
                  "env": {
                    "SERVER_API_KEY": "your_server_api_token"
                  }
              }
          }
      }

由于 MCP 服务器返回的内容格式可能不同，用户可以继承 ``MCPBaseTool`` 并覆盖 ``_parse_tool_result`` 方法来实现自定义解析逻辑。

.. code-block:: python

   class MCPYourTool(MCPBaseTool):
       def __init__(self, config: dict, tool_schema: OpenAIFunctionToolSchema):
           super().__init__(config, tool_schema)

       def _parse_tool_result(self, content: list) -> Tuple[str, dict]:
           ...

总体而言，您可以参考 `mcp_search_tool.py_ <https://github.com/volcengine/verl/blob/main/verl/tools/mcp_search_tool.py>`_ 和 `mcp_tool_config.yaml_ <https://github.com/volcengine/verl/blob/main/examples/sglang_multiturn/config/tool_config/mcp_tool_config.yaml>`_ 进行自定义实现和配置。

多轮 Tokenization
~~~~~~~~~~~~~~~~~~~~~~~

多轮 rollout 的 Tokenization 存在挑战：在应用聊天模板并对整个消息列表进行 Tokenization 后，很难确定哪些 token 属于助手消息。由于 token 列表是扁平的，它与消息角色缺乏直接的对应关系。

为了解决这个问题，我们采用了一种 **增量式 Tokenization** 策略。每次 LLM 生成新消息时，我们：

1. 对所有先前消息（`messages[:i]`）应用聊天模板。
2. 再次应用聊天模板，但包含最新消息（`messages[:i+1]`）。
3. 只对这两个序列化消息字符串之间的 **增量** 进行 Tokenization。

这确保了只有助手生成的 token 才会被包含在损失掩码中。

.. code-block:: python

   # 使用 tokenizer 时
   # 通过将 add_generation_prompt=True 来排除助手的提示（例如 "<|im_start|>assistant"）
   prev = tokenizer.apply_chat_template(messages[:i], add_generation_prompt=True, tokenize=False)
   curr = tokenizer.apply_chat_template(messages[:i+1], add_generation_prompt=False, tokenize=False)
   token_ids += tokenizer.encode(curr[len(prev):], add_special_tokens=False)
   loss_mask += [1] * len(token_ids)  # 只掩码新的助手 token

.. code-block:: python

   # 使用 processor 时
   # 通过将 add_generation_prompt=True 来排除助手的提示（例如 "<|im_start|>assistant"）
   prev = processor.apply_chat_template(messages[:i], add_generation_prompt=True, tokenize=False)
   prev_model_inputs = processor(text=prev, images=images, videos=videos, return_tensors="pt")[0].tolist()
   curr = processor.apply_chat_template(messages[:i+1], add_generation_prompt=False, tokenize=False)
   curr_model_inputs = processor(text=curr, images=images, videos=videos, return_tensors="pt")[0].tolist()
   token_ids += curr_model_inputs["input_ids"][len(prev_model_inputs["input_ids"]):]
   loss_mask += [1] * len(token_ids)  # 只掩码新的助手 token

虽然我们已经验证了这种方法可以与完整的消息 Tokenization 产生一致的结果，但未来模型的聊天模板可能会破坏兼容性。为防止出现无声的不一致，我们默认在每次 rollout 结束时将增量 Tokenization 的结果与完整 Tokenization 的结果进行比较。

如果您看到以下警告，可以查看日志中的不匹配子字符串：

.. code-block::

    检测到训练和推理 Tokenization 不一致。这可能导致训练期间出现意外行为。请检查您的聊天模板以确定这是否是故意的。有关更多信息，请参阅 multiturn README.md。

Tokenization 健全性检查模式可以通过 ``actor_rollout_ref.rollout.multi_turn.tokenization_sanity_check_mode`` 参数配置，该参数接受以下值：

- ``strict``（默认）：对增量式和完整 Tokenization 结果进行严格比较，对任何差异发出警告。

- ``ignore_strippable``：在检查有意义的文本不匹配的同时，忽略空白字符（``\n``、``\t``、``\r``、空格）的差异。这在调试预期且可接受的空白变化的聊天模板问题时非常有用。

- ``disable``：完全禁用 Tokenization 健全性检查。只有在您已彻底验证 Tokenization 差异是预期的并且不会影响训练的情况下才使用此选项。

配置示例：

.. code-block:: yaml

    actor_rollout_ref:
        rollout:
            multi_turn:
                tokenization_sanity_check_mode: "ignore_strippable"  # 可选值："disable", "ignore_strippable", "strict"

数据集中的多模态输入处理
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

如果您的数据集包含多模态输入（例如图像或视频），您可以通过在您的 dataset 配置（供 RLHFDataset 使用）中设置 ``return_multi_modal_inputs`` 标志来控制这些输入是否被预处理并包含在每个样本中。

- ``return_multi_modal_inputs: True``（默认）：数据集将预处理并为每个样本包含一个 ``multi_modal_inputs`` 字典。该字典包含模型就绪的表示（例如，图像张量、视频张量等），由您的处理器生成。这对于单轮或 SFT 风格的训练很有用，其中模型期望所有模态都在批次中存在。

- ``return_multi_modal_inputs: False``：数据集将不包含 ``multi_modal_inputs`` 字段。这推荐用于多轮 RL 或工具增强的 rollout，其中模型可能会在 rollout 期间动态生成新的多模态输入，而您希望避免批次中的冲突或冗余数据。


特殊情况
^^^^^^^^^^^^^

一些模型（例如 Qwen/QwQ-32B 和 Qwen3 系列）在渲染聊天模板时会移除内部推理内容。因此，消息内容在不同的轮次之间可能会有所不同，导致增量式 Tokenization 不准确。

例如，对于以下对话：

.. code-block:: python

    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is 2 + 2?"},
        {"role": "assistant", "content": "<think>user asked about a simple math question.</think> 2 + 2 = 4."},
        {"role": "user", "content": "Explain why."},
        {"role": "assistant", "content": "<think>user wants to know the reasoning behind the answer. Search for a good explanation</think>",
         "tool_calls": [{"id": "tool1", "type": "search", "arguments": {"query": "Why is 2 + 2 = 4?"}}]},
        {"role": "tool", "content": "The sum of two and two is four because it is a basic arithmetic operation."},
        {"role": "assistant", "content": "<think>The tool provided a good explanation.</think>The sum of two and two is four because it is a basic arithmetic operation."}
    ]

1. Qwen/QwQ-32B 在应用聊天模板后，除了最后一条助手消息外，会移除所有推理内容。

.. code-block:: text

    <|im_start|>system
    You are a helpful assistant.<|im_end|>
    <|im_start|>user
    What is 2 + 2?<|im_end|>
    <|im_start|>assistant
     2 + 2 = 4.<|im_end|>
    <|im_start|>user
    Explain why.<|im_end|>
    <|im_start|>assistant
    <tool_call>
    {"name": "", "arguments": {"query": "Why is 2 + 2 = 4?"}}
    </tool_call><|im_end|>
    <|im_start|>user
    <tool_response>
    The sum of two and two is four because it is a basic arithmetic operation.
    </tool_response><|im_end|>
    <|im_start|>assistant
    <think>The tool provided a good explanation.</think> The sum of two and two is four because it is a basic arithmetic operation.<|im_end|>

2. Qwen3 系列会在最后一条用户消息之前移除所有推理内容。

.. code-block:: text

    <|im_start|>system
    You are a helpful assistant.<|im_end|>
    <|im_start|>user
    What is 2 + 2?<|im_end|>
    <|im_start|>assistant
     2 + 2 = 4.<|im_end|>
    <|im_start|>user
    Explain why.<|im_end|>
    <|im_start|>assistant
    <think>
    user wants to know the reasoning behind the answer. Search for a good explanation
    </think>

    <tool_call>
    {"name": "", "arguments": {"query": "Why is 2 + 2 = 4?"}}
    </tool_call><|im_end|>
    <|im_start|>user
    <tool_response>
    The sum of two and two is four because it is a basic arithmetic operation.
    </tool_response><|im_end|>
    <|im_start|>assistant
    <think>
    The tool provided a good explanation.
    </think>

    The sum of two and two is four because it is a basic arithmetic operation.<|im_end|>

为了解决这个问题，我们回退到**固定的基础对话**，该对话仅包含一条系统消息和一条用户消息。由于这个基础不包含助手消息或推理内容，因此它在所有轮次中保持一致。

.. code-block:: python

    BASE_CHAT_HISTORY = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "I am a user."}
    ]
    prev = tokenizer.apply_chat_template(BASE_CHAT_HISTORY, add_generation_prompt=True, tokenize=False)
    curr = tokenizer.apply_chat_template([*BASE_CHAT_HISTORY, messages[i]], add_generation_prompt=False, tokenize=False)
    token_ids += tokenizer.encode(curr[len(prev):], add_special_tokens=False)
    loss_mask += [1] * len(token_ids)

这种方法对 Qwen3 系列效果很好。然而，Qwen/QwQ-32B 目前在其聊天模板中存在一个 bug。已经提出了一个修复 _，但尚未被采纳。在此之前，请使用以下命令下载修复后的模型修订版本：

.. code-block:: bash

    pip install huggingface_hub
    huggingface-cli download Qwen/QwQ-32B --revision refs/pr/81

.. _fix: https://huggingface.co/Qwen/QwQ-32B/discussions/81

训练和推理模板之间的差异
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

虽然上述方法解决了增量不匹配问题，但推理时聊天模板中移除推理内容会引入新的差异：训练使用完整的推理内容，而推理则不使用。

这种不匹配可能会以不可预测的方式影响模型性能。为了避免这种情况，我们默认同时对训练和 rollout 使用完整的响应（包括推理）。

然而，这种方法也有其权衡：

1. 长篇的推理内容很容易超过模型的上下文窗口，尤其是在多轮 rollout 中。
2. 现在 rollout 和生产环境之间存在不匹配——如果您在生产环境中使用默认的聊天模板，模型将不会包含过去轮次的推理内容。

我们仍在评估这些问题的影响。如果您遇到上下文长度问题，或者更希望 rollout 与生产环境匹配（即排除推理），您可以启用：

``actor_rollout_ref.rollout.multi_turn.use_inference_chat_template = True``

GSM8K 多轮训练性能
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

在此处 _ 查看在 GSM8K 任务上多轮 rollout 的训练性能。

.. _HERE: https://wandb.ai/zhaochenyang20/gsm8k_async_rl/runs/1ro1r7om?nw=nwuserzhaochenyang20

交互系统
~~~~~~~~~~~~~~~~~~

有关 RL 训练期间动态对话反馈，请参阅：

.. toctree::
   :maxdepth: 1

   interaction_system

搜索工具集成
~~~~~~~~~~~~~~~~~~

.. toctree::
   :maxdepth: 1

   search_tool_example

代码详解
~~~~~~~~~~~~~~~~~~~~~~~
如果您想更深入地了解代码执行流程，请阅读 https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/rlhf/verl/multi-turn/code-walk-through