训练器接口

最后更新:2025 年 8 月 6 日(API docstrings 自动生成)。

训练器负责驱动训练循环。鼓励在出现新的训练范式时引入新的训练器类。

verl.trainer.ppo.ray_trainer.RayPPOTrainer

Distributed PPO trainer using Ray for scalable reinforcement learning.

核心 API

class verl.trainer.ppo.ray_trainer.RayPPOTrainer(config, tokenizer, role_worker_mapping: dict[~verl.trainer.ppo.utils.Role, type[~verl.single_controller.base.worker.Worker]], resource_pool_manager: ~verl.trainer.ppo.ray_trainer.ResourcePoolManager, ray_worker_group_cls: type[~verl.single_controller.ray.base.RayWorkerGroup] = <class 'verl.single_controller.ray.base.RayWorkerGroup'>, processor=None, reward_fn=None, val_reward_fn=None, train_dataset: ~torch.utils.data.dataset.Dataset | None = None, val_dataset: ~torch.utils.data.dataset.Dataset | None = None, collate_fn=None, train_sampler: ~torch.utils.data.sampler.Sampler | None = None, device_name=None)[source]

Distributed PPO trainer using Ray for scalable reinforcement learning.

This trainer orchestrates distributed PPO training across multiple nodes and GPUs, managing actor rollouts, critic training, and reward computation with Ray backend. Supports various model architectures including FSDP, Megatron, vLLM, and SGLang integration.

__init__(config, tokenizer, role_worker_mapping: dict[~verl.trainer.ppo.utils.Role, type[~verl.single_controller.base.worker.Worker]], resource_pool_manager: ~verl.trainer.ppo.ray_trainer.ResourcePoolManager, ray_worker_group_cls: type[~verl.single_controller.ray.base.RayWorkerGroup] = <class 'verl.single_controller.ray.base.RayWorkerGroup'>, processor=None, reward_fn=None, val_reward_fn=None, train_dataset: ~torch.utils.data.dataset.Dataset | None = None, val_dataset: ~torch.utils.data.dataset.Dataset | None = None, collate_fn=None, train_sampler: ~torch.utils.data.sampler.Sampler | None = None, device_name=None)[source]

Initialize distributed PPO trainer with Ray backend. Note that this trainer runs on the driver process on a single CPU/GPU node.

Parameters:
  • config – Configuration object containing training parameters.

  • tokenizer – Tokenizer used for encoding and decoding text.

  • role_worker_mapping (dict[Role, WorkerType]) – Mapping from roles to worker classes.

  • resource_pool_manager (ResourcePoolManager) – Manager for Ray resource pools.

  • ray_worker_group_cls (RayWorkerGroup, optional) – Class for Ray worker groups. Defaults to RayWorkerGroup.

  • processor – Optional data processor, used for multimodal data

  • reward_fn – Function for computing rewards during training.

  • val_reward_fn – Function for computing rewards during validation.

  • train_dataset (Optional[Dataset], optional) – Training dataset. Defaults to None.

  • val_dataset (Optional[Dataset], optional) – Validation dataset. Defaults to None.

  • collate_fn – Function to collate data samples into batches.

  • train_sampler (Optional[Sampler], optional) – Sampler for the training dataset. Defaults to None.

  • device_name (str, optional) – Device name for training (e.g., “cuda”, “cpu”). Defaults to None.

fit()[source]

The training loop of PPO. The driver process only need to call the compute functions of the worker group through RPC to construct the PPO dataflow. The light-weight advantage computation is done on the driver process.

init_workers()[source]

Initialize distributed training workers using Ray backend.

Creates: 1. Ray resource pools from configuration 2. Worker groups for each role (actor, critic, etc.)

Utils for tokenization.

verl.utils.tokenizer.hf_tokenizer(name_or_path, correct_pad_token=True, correct_gemma2=True, **kwargs)[source]

Create a huggingface pretrained tokenizer which correctness handles eos and pad tokens.

Parameters:
  • name (str) – The name of the tokenizer.

  • correct_pad_token (bool) – Whether to correct the pad token id.

  • correct_gemma2 (bool) – Whether to correct the gemma2 tokenizer.

Returns:

The pretrained tokenizer.

Return type:

transformers.PreTrainedTokenizer

Core functions to implement PPO algorithms. The function implemented in this file should be used by trainer with different distributed strategies to implement PPO-like algorithms.

verl.trainer.ppo.core_algos.agg_loss(loss_mat: Tensor, loss_mask: Tensor, loss_agg_mode: str)[source]

Aggregate the loss matrix into a scalar.

Parameters:
  • loss_mat(torch.Tensor): shape: (bs, response_length)

  • loss_mask(torch.Tensor): shape: (bs, response_length)

  • loss_agg_mode – (str) choices: method to aggregate the loss matrix into a scalar.

Returns:

a scalar torch.Tensor

aggregated loss

Return type:

loss

verl.trainer.ppo.core_algos.compute_policy_loss(old_log_prob, log_prob, advantages, response_mask, cliprange=None, cliprange_low=None, cliprange_high=None, clip_ratio_c=3.0, loss_agg_mode: str = 'token-mean')[source]

Compute the clipped policy objective and related metrics for PPO.

Adapted from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1122

Parameters:
  • old_log_prob (torch.Tensor) – Log-probabilities of actions under the old policy, shape (batch_size, response_length).

  • log_prob (torch.Tensor) – Log-probabilities of actions under the current policy, shape (batch_size, response_length).

  • advantages (torch.Tensor) – Advantage estimates for each action, shape (batch_size, response_length).

  • response_mask (torch.Tensor) – Mask indicating which tokens to include in the loss, shape (batch_size, response_length).

  • cliprange (float, optional) – Clipping parameter ε for standard PPO. See https://arxiv.org/abs/1707.06347. Defaults to None (must be provided).

  • cliprange_low (float, optional) – Lower clip range for dual-clip PPO. Defaults to same as cliprange.

  • cliprange_high (float, optional) – Upper clip range for dual-clip PPO. Defaults to same as cliprange.

  • clip_ratio_c (float, optional) – Lower bound of the ratio for dual-clip PPO. See https://arxiv.org/pdf/1912.09729. Defaults to 3.0.

  • loss_agg_mode (str, optional) – Aggregation mode for agg_loss. Defaults to “token-mean”.

verl.trainer.ppo.core_algos.kl_penalty(logprob: FloatTensor, ref_logprob: FloatTensor, kl_penalty) FloatTensor[source]

Compute KL divergence given logprob and ref_logprob. Optionally using straight through to bind k2 on other kl penalty compute method for unbiased KL gradient estimation. See more description in http://joschu.net/blog/kl-approx.html

Parameters:
  • logprob

  • ref_logprob

Returns:

kl_estimate

verl.trainer.ppo.core_algos.kl_penalty(logprob: FloatTensor, ref_logprob: FloatTensor, kl_penalty) FloatTensor[source]

Compute KL divergence given logprob and ref_logprob. Optionally using straight through to bind k2 on other kl penalty compute method for unbiased KL gradient estimation. See more description in http://joschu.net/blog/kl-approx.html

Parameters:
  • logprob

  • ref_logprob

Returns:

kl_estimate

verl.trainer.ppo.reward.compute_reward(data: DataProto, reward_fn: AbstractRewardManager) tuple[Tensor, dict[str, Any]][source]

Compute reward for a batch of data. :param data: DataProto object containing the input data. :param reward_fn: Reward function to compute the reward.

Returns:

Tuple of reward tensor and extra info dictionary.

verl.trainer.ppo.reward.load_reward_manager(config: DictConfig, tokenizer: Any, num_examine: int, **reward_kwargs: Any) AbstractRewardManager[source]

Load and initialize a reward manager based on the configuration.

Parameters:
  • config – PPO trainer configuration object containing reward_model fields.

  • tokenizer – Tokenizer object used for processing text.

  • num_examine – Number of samples to examine.

  • **reward_kwargs – Additional keyword arguments for the reward manager.

Returns:

An instance of the specified reward manager class.

class verl.workers.reward_manager.NaiveRewardManager(tokenizer, num_examine, compute_score=None, reward_fn_key='data_source')[source]

The reward manager.

class verl.workers.reward_manager.DAPORewardManager(tokenizer, num_examine, compute_score=None, reward_fn_key='data_source', max_resp_len=None, overlong_buffer_cfg=None)[source]

The reward manager.