多轮 rollout 支持

最后更新：2025 年 6 月 27 日。

基本配置

要启用多轮 rollout，请确保在您的 rollout 配置中设置以下字段：

actor_rollout_ref:
    rollout:
        multi_turn: True
        name: "sglang"

此配置将激活 sglang 引擎，以便在 rollout 期间进行多轮交互。

自定义工具配置

对于自定义环境交互工具，您可以基于 verl.tools.base_tool.BaseTool 实现自己的工具。然后，在 YAML 文件中指定您的工具配置：

tools:
  - class_name: ""
    config:
        type: native
    tool_schema:

您可以参考 GSM8KTool_example_configuration_，这是工具配置的一个示例。其实现可以在 gsm8k_tool.py_ 中找到。

最后，在您的 rollout 配置中设置 tools_config_file：

actor_rollout_ref:
    rollout:
        tool_kwargs:
            tools_config_file: <path_to_tool_yaml_file>

这允许在 actor rollout 步骤中集成定制化的工具行为。

如果您希望进行模拟交互的 rollout，可以在 rollout 配置中设置 interaction_config_file：

interaction:
  - class_name: ""
    config: {}

actor_rollout_ref:
    rollout:
        interaction_config_file: <path_to_interaction_yaml_file>

如果您的工具创建了多模态输入，您应该在 tool.execute() 实现中返回一个多模态输入列表。

图像和视频应在返回前进行处理。例如，如果您使用的是 Qwen2.5-VL，可以使用以下代码获取表示：

async def create(self, ...) -> tuple[str, ToolResponse]:
    ...
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # 由于 vllm 中 (image | video) 键是 ("image" | "video") 而不是 ("images" | "videos")，
    # 我们需要使用 ("image" | "video") 来指定图像/视频列表
    # 链接: https://github.com/vllm-project/vllm/blob/3c545c0c3b98ee642373a308197d750d0e449403/vllm/multimodal/parse.py#L205
    return instance_id, ToolResponse(image=[img1, ...], video=[video1, ...], text="...")

async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
    ...
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # 由于 vllm 中 (image | video) 键是 ("image" | "video") 而不是 ("images" | "videos")，
    # 我们需要使用 ("image" | "video") 来指定图像/视频列表
    # 链接: https://github.com/vllm-project/vllm/blob/3c545c0c3b98ee642373a308197d750d0e449403/vllm/multimodal/parse.py#L205
    return ToolResponse(image=[img1, ...], video=[video1, ...], text="..."), 0, {}

请记住在您的 dataset 配置中设置 return_multi_modal_inputs: False，以便正确处理 rollout 中的多模态输入。有关更多详细信息，请参阅 `Handling Multi-Modal Inputs in Datasets`_ 部分。

MCP 工具配置

对于 MCP 交互工具，您可以使用 YAML 文件灵活配置它们。典型的设置如下：

tools:
  - class_name: ""
    config:
        type: mcp
    mcp:
        mcp_servers_config_path: ./mcp_server.json
        tool_selected_list: {}

tool_selected_list 字段是可选的，它指定要从服务器使用的工具。如果您想启用所有可用工具，只需省略此属性即可。此外，mcp_servers_config_path 指向一个包含 MCP 服务器配置的 JSON 文件。例如：

{
    "mcpServers": {
        "SSE Server": {
            "url": "your_server_url",
            "auth_token": "your_server_api_token"
        },
        "STDIO Server": {
            "command": "npx",
            "args": ["-y", "server-mcp@0.2.1"],
            "env": {
              "SERVER_API_KEY": "your_server_api_token"
            }
        }
    }
}

由于 MCP 服务器返回的内容格式可能不同，用户可以继承 MCPBaseTool 并覆盖 _parse_tool_result 方法来实现自定义解析逻辑。

class MCPYourTool(MCPBaseTool):
    def __init__(self, config: dict, tool_schema: OpenAIFunctionToolSchema):
        super().__init__(config, tool_schema)

    def _parse_tool_result(self, content: list) -> Tuple[str, dict]:
        ...

总体而言，您可以参考 mcp_search_tool.py_ 和 mcp_tool_config.yaml_ 进行自定义实现和配置。

多轮 Tokenization

多轮 rollout 的 Tokenization 存在挑战：在应用聊天模板并对整个消息列表进行 Tokenization 后，很难确定哪些 token 属于助手消息。由于 token 列表是扁平的，它与消息角色缺乏直接的对应关系。

为了解决这个问题，我们采用了一种 增量式 Tokenization 策略。每次 LLM 生成新消息时，我们：

对所有先前消息（messages[:i]）应用聊天模板。
再次应用聊天模板，但包含最新消息（messages[:i+1]）。
只对这两个序列化消息字符串之间的增量进行 Tokenization。

这确保了只有助手生成的 token 才会被包含在损失掩码中。

# 使用 tokenizer 时
# 通过将 add_generation_prompt=True 来排除助手的提示（例如 "<|im_start|>assistant"）
prev = tokenizer.apply_chat_template(messages[:i], add_generation_prompt=True, tokenize=False)
curr = tokenizer.apply_chat_template(messages[:i+1], add_generation_prompt=False, tokenize=False)
token_ids += tokenizer.encode(curr[len(prev):], add_special_tokens=False)
loss_mask += [1] * len(token_ids)  # 只掩码新的助手 token

# 使用 processor 时
# 通过将 add_generation_prompt=True 来排除助手的提示（例如 "<|im_start|>assistant"）
prev = processor.apply_chat_template(messages[:i], add_generation_prompt=True, tokenize=False)
prev_model_inputs = processor(text=prev, images=images, videos=videos, return_tensors="pt")[0].tolist()
curr = processor.apply_chat_template(messages[:i+1], add_generation_prompt=False, tokenize=False)
curr_model_inputs = processor(text=curr, images=images, videos=videos, return_tensors="pt")[0].tolist()
token_ids += curr_model_inputs["input_ids"][len(prev_model_inputs["input_ids"]):]
loss_mask += [1] * len(token_ids)  # 只掩码新的助手 token

虽然我们已经验证了这种方法可以与完整的消息 Tokenization 产生一致的结果，但未来模型的聊天模板可能会破坏兼容性。为防止出现无声的不一致，我们默认在每次 rollout 结束时将增量 Tokenization 的结果与完整 Tokenization 的结果进行比较。

如果您看到以下警告，可以查看日志中的不匹配子字符串：

检测到训练和推理 Tokenization 不一致。这可能导致训练期间出现意外行为。请检查您的聊天模板以确定这是否是故意的。有关更多信息，请参阅 multiturn README.md。

Tokenization 健全性检查模式可以通过 actor_rollout_ref.rollout.multi_turn.tokenization_sanity_check_mode 参数配置，该参数接受以下值：

``strict``（默认）：对增量式和完整 Tokenization 结果进行严格比较，对任何差异发出警告。
ignore_strippable：在检查有意义的文本不匹配的同时，忽略空白字符（\n、\t、\r、空格）的差异。这在调试预期且可接受的空白变化的聊天模板问题时非常有用。
disable：完全禁用 Tokenization 健全性检查。只有在您已彻底验证 Tokenization 差异是预期的并且不会影响训练的情况下才使用此选项。

配置示例：

actor_rollout_ref:
    rollout:
        multi_turn:
            tokenization_sanity_check_mode: "ignore_strippable"  # 可选值："disable", "ignore_strippable", "strict"

数据集中的多模态输入处理

如果您的数据集包含多模态输入（例如图像或视频），您可以通过在您的 dataset 配置（供 RLHFDataset 使用）中设置 return_multi_modal_inputs 标志来控制这些输入是否被预处理并包含在每个样本中。

return_multi_modal_inputs: True``（默认）：数据集将预处理并为每个样本包含一个 ``multi_modal_inputs 字典。该字典包含模型就绪的表示（例如，图像张量、视频张量等），由您的处理器生成。这对于单轮或 SFT 风格的训练很有用，其中模型期望所有模态都在批次中存在。
return_multi_modal_inputs: False：数据集将不包含 multi_modal_inputs 字段。这推荐用于多轮 RL 或工具增强的 rollout，其中模型可能会在 rollout 期间动态生成新的多模态输入，而您希望避免批次中的冲突或冗余数据。

特殊情况

一些模型（例如 Qwen/QwQ-32B 和 Qwen3 系列）在渲染聊天模板时会移除内部推理内容。因此，消息内容在不同的轮次之间可能会有所不同，导致增量式 Tokenization 不准确。

例如，对于以下对话：

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2 + 2?"},
    {"role": "assistant", "content": "<think>user asked about a simple math question.</think> 2 + 2 = 4."},
    {"role": "user", "content": "Explain why."},
    {"role": "assistant", "content": "<think>user wants to know the reasoning behind the answer. Search for a good explanation</think>",
     "tool_calls": [{"id": "tool1", "type": "search", "arguments": {"query": "Why is 2 + 2 = 4?"}}]},
    {"role": "tool", "content": "The sum of two and two is four because it is a basic arithmetic operation."},
    {"role": "assistant", "content": "<think>The tool provided a good explanation.</think>The sum of two and two is four because it is a basic arithmetic operation."}
]

Qwen/QwQ-32B 在应用聊天模板后，除了最后一条助手消息外，会移除所有推理内容。

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2 + 2?<|im_end|>
<|im_start|>assistant
 2 + 2 = 4.<|im_end|>
<|im_start|>user
Explain why.<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "", "arguments": {"query": "Why is 2 + 2 = 4?"}}
</tool_call><|im_end|>
<|im_start|>user
<tool_response>
The sum of two and two is four because it is a basic arithmetic operation.
</tool_response><|im_end|>
<|im_start|>assistant
<think>The tool provided a good explanation.</think> The sum of two and two is four because it is a basic arithmetic operation.<|im_end|>

Qwen3 系列会在最后一条用户消息之前移除所有推理内容。

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2 + 2?<|im_end|>
<|im_start|>assistant
 2 + 2 = 4.<|im_end|>
<|im_start|>user
Explain why.<|im_end|>
<|im_start|>assistant
<think>
user wants to know the reasoning behind the answer. Search for a good explanation
</think>

<tool_call>
{"name": "", "arguments": {"query": "Why is 2 + 2 = 4?"}}
</tool_call><|im_end|>
<|im_start|>user
<tool_response>
The sum of two and two is four because it is a basic arithmetic operation.
</tool_response><|im_end|>
<|im_start|>assistant
<think>
The tool provided a good explanation.
</think>

The sum of two and two is four because it is a basic arithmetic operation.<|im_end|>

为了解决这个问题，我们回退到**固定的基础对话**，该对话仅包含一条系统消息和一条用户消息。由于这个基础不包含助手消息或推理内容，因此它在所有轮次中保持一致。

BASE_CHAT_HISTORY = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "I am a user."}
]
prev = tokenizer.apply_chat_template(BASE_CHAT_HISTORY, add_generation_prompt=True, tokenize=False)
curr = tokenizer.apply_chat_template([*BASE_CHAT_HISTORY, messages[i]], add_generation_prompt=False, tokenize=False)
token_ids += tokenizer.encode(curr[len(prev):], add_special_tokens=False)
loss_mask += [1] * len(token_ids)

这种方法对 Qwen3 系列效果很好。然而，Qwen/QwQ-32B 目前在其聊天模板中存在一个 bug。已经提出了一个修复 _，但尚未被采纳。在此之前，请使用以下命令下载修复后的模型修订版本：

pip install huggingface_hub
huggingface-cli download Qwen/QwQ-32B --revision refs/pr/81

训练和推理模板之间的差异

虽然上述方法解决了增量不匹配问题，但推理时聊天模板中移除推理内容会引入新的差异：训练使用完整的推理内容，而推理则不使用。

这种不匹配可能会以不可预测的方式影响模型性能。为了避免这种情况，我们默认同时对训练和 rollout 使用完整的响应（包括推理）。

然而，这种方法也有其权衡：

长篇的推理内容很容易超过模型的上下文窗口，尤其是在多轮 rollout 中。
现在 rollout 和生产环境之间存在不匹配——如果您在生产环境中使用默认的聊天模板，模型将不会包含过去轮次的推理内容。

我们仍在评估这些问题的影响。如果您遇到上下文长度问题，或者更希望 rollout 与生产环境匹配（即排除推理），您可以启用：

actor_rollout_ref.rollout.multi_turn.use_inference_chat_template = True

GSM8K 多轮训练性能

在此处 _ 查看在 GSM8K 任务上多轮 rollout 的训练性能。

交互系统

有关 RL 训练期间动态对话反馈，请参阅：

交互式系统，用于多轮强化学习训练

搜索工具集成

搜索工具集成

代码详解

如果您想更深入地了解代码执行流程，请阅读 https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/rlhf/verl/multi-turn/code-walk-through