Trace 函数使用说明

最后更新：2025 年 10 月 7 日。

适用场景

Agentic RL 在推出过程中涉及多轮对话、工具调用和用户交互。在模型训练过程中，需要跟踪函数调用、输入和输出来理解应用程序内的数据流路径。Trace 功能有助于在复杂的多轮对话中，通过记录函数的输入、输出和相应的时间戳，查看每次交互及导致最终输出的整个过程中数据的转换，这有利于理解模型处理数据的细节并优化训练结果。

Trace 功能集成了常用的 Agent trace 工具，包括 wandb weave 和 mlflow，这些工具已得到支持。用户可以根据自己的需求和偏好选择合适的 trace 工具。在此，我们将介绍每种工具的使用方法。

Trace 参数配置

actor_rollout_ref.rollout.trace.backend=mlflow|weave # trace 后端类型
actor_rollout_ref.rollout.trace.token2text=True # 在 trace 视图中显示解码后的文本

术语表

Rollout trace 函数

有两个用于 tracing 的函数：

rollout_trace_op：这是一个装饰器函数，用于标记需要 trace 的函数。默认情况下，只有少数方法带有此装饰器，您可以将其添加到更多函数以 trace 更多信息。
rollout_trace_attr：此函数用于标记 trajectory 的入口点并输入一些信息进行 trace。如果您添加了新型 agent，可能需要添加此函数以启用 trace。

wandb weave 使用方法

1.1 基本配置

设置 WANDB_API_KEY 环境变量
配置参数
1. actor_rollout_ref.rollout.trace.backend=weave
2. trainer.logger=['console', 'wandb']：此项为可选。Trace 和 logger 是独立的功能。使用 Weave 时，建议同时启用 wandb logger 以在一个系统中实现这两种功能。
3. trainer.project_name=$project_name
4. trainer.experiment_name=$experiment_name
5. actor_rollout_ref.rollout.mode=async：由于 trace 主要用于 agentic RL，需要为 vllm 或 sglang 启用 agent tool 的异步模式。

注意： Weave 免费版每月提供 1GB 的网络流量额度。在训练过程中，生成的 trace 数据量非常大，每天可能达到几十 GB，因此需要选择合适的 wandb 套餐。

1.2 查看 Trace 日志

执行训练后，在项目页面，您可以看到 WEAVE 侧边栏。点击 Traces 即可查看。

每个 Trace 项目对应一个 trajectory。您可以通过 step、sample_index、rollout_n 和 experiment_name 来过滤和选择您需要查看的 trajectory。

启用 token2text 后，prompt_text 和 response_text 会自动添加到 ToolAgentLoop.run 的输出中，方便查看输入和输出内容。

https://github.com/eric-haibin-lin/verl-community/blob/main/docs/weave_trace_list.png?raw=true

1.3 对比 Trace 日志

Weave 可以选择多个 trace 项目，然后对比它们之间的差异。

https://github.com/eric-haibin-lin/verl-community/blob/main/docs/weave_trace_compare.png?raw=true

mlflow 使用方法

1. 基本配置

设置 MLFLOW_TRACKING_URI 环境变量，它可以是：
1. 对应在线服务的 http 和 https URL
2. 本地文件或目录，例如 sqlite:////tmp/mlruns.db，表示数据存储在 /tmp/mlruns.db。使用本地文件时，需要先初始化文件（例如，启动 UI：mlflow ui --backend-store-uri sqlite:////tmp/mlruns.db），以避免多个 worker 同时创建文件时发生冲突。
配置参数
1. actor_rollout_ref.rollout.trace.backend=mlflow
2. trainer.logger=['console', 'mlflow']。此项为可选。Trace 和 logger 是独立的功能。使用 mlflow 时，建议同时启用 mlflow logger 以在一个系统中实现这两种功能。
3. trainer.project_name=$project_name
4. trainer.experiment_name=$experiment_name

2. 查看日志

由于 trainer.project_name 对应 mlflow 中的 Experiments，在 mlflow 视图中，您需要选择对应的项目名称，然后点击“Traces”选项卡来查看 traces。其中，trainer.experiment_name 对应 tags 中的 experiment_name，而 step、sample_index、rollout_n 等对应的 tags 用于过滤和查看。

例如，搜索 "tags.step = '1'" 可以显示 step 1 的所有 trajectory。

https://github.com/eric-haibin-lin/verl-community/blob/main/docs/mlflow_trace_list.png?raw=true

打开其中一个 trajectory，您可以看到其中的每个函数调用过程。

启用 token2text 后，prompt_text 和 response_text 会自动添加到 ToolAgentLoop.run 的输出中，方便查看内容。

https://github.com/eric-haibin-lin/verl-community/blob/main/docs/mlflow_trace_view.png?raw=true

注意：

mlflow 不支持对比多个 traces
rollout_trace 无法将 mlflow trace 与 run 相关联，因此在 mlflow 运行日志中看不到 trace 内容。