训练器接口
最后更新:2025 年 8 月 6 日(API docstrings 自动生成)。
训练器负责驱动训练循环。鼓励在出现新的训练范式时引入新的训练器类。
Distributed PPO trainer using Ray for scalable reinforcement learning. |
核心 API
- class verl.trainer.ppo.ray_trainer.RayPPOTrainer(config, tokenizer, role_worker_mapping: dict[~verl.trainer.ppo.utils.Role, type[~verl.single_controller.base.worker.Worker]], resource_pool_manager: ~verl.trainer.ppo.ray_trainer.ResourcePoolManager, ray_worker_group_cls: type[~verl.single_controller.ray.base.RayWorkerGroup] = <class 'verl.single_controller.ray.base.RayWorkerGroup'>, processor=None, reward_fn=None, val_reward_fn=None, train_dataset: ~torch.utils.data.dataset.Dataset | None = None, val_dataset: ~torch.utils.data.dataset.Dataset | None = None, collate_fn=None, train_sampler: ~torch.utils.data.sampler.Sampler | None = None, device_name=None)[source]
Distributed PPO trainer using Ray for scalable reinforcement learning.
This trainer orchestrates distributed PPO training across multiple nodes and GPUs, managing actor rollouts, critic training, and reward computation with Ray backend. Supports various model architectures including FSDP, Megatron, vLLM, and SGLang integration.
- __init__(config, tokenizer, role_worker_mapping: dict[~verl.trainer.ppo.utils.Role, type[~verl.single_controller.base.worker.Worker]], resource_pool_manager: ~verl.trainer.ppo.ray_trainer.ResourcePoolManager, ray_worker_group_cls: type[~verl.single_controller.ray.base.RayWorkerGroup] = <class 'verl.single_controller.ray.base.RayWorkerGroup'>, processor=None, reward_fn=None, val_reward_fn=None, train_dataset: ~torch.utils.data.dataset.Dataset | None = None, val_dataset: ~torch.utils.data.dataset.Dataset | None = None, collate_fn=None, train_sampler: ~torch.utils.data.sampler.Sampler | None = None, device_name=None)[source]
Initialize distributed PPO trainer with Ray backend. Note that this trainer runs on the driver process on a single CPU/GPU node.
- Parameters:
config – Configuration object containing training parameters.
tokenizer – Tokenizer used for encoding and decoding text.
role_worker_mapping (dict[Role, WorkerType]) – Mapping from roles to worker classes.
resource_pool_manager (ResourcePoolManager) – Manager for Ray resource pools.
ray_worker_group_cls (RayWorkerGroup, optional) – Class for Ray worker groups. Defaults to RayWorkerGroup.
processor – Optional data processor, used for multimodal data
reward_fn – Function for computing rewards during training.
val_reward_fn – Function for computing rewards during validation.
train_dataset (Optional[Dataset], optional) – Training dataset. Defaults to None.
val_dataset (Optional[Dataset], optional) – Validation dataset. Defaults to None.
collate_fn – Function to collate data samples into batches.
train_sampler (Optional[Sampler], optional) – Sampler for the training dataset. Defaults to None.
device_name (str, optional) – Device name for training (e.g., “cuda”, “cpu”). Defaults to None.
Utils for tokenization.
- verl.utils.tokenizer.hf_tokenizer(name_or_path, correct_pad_token=True, correct_gemma2=True, **kwargs)[source]
Create a huggingface pretrained tokenizer which correctness handles eos and pad tokens.
- Parameters:
name (str) – The name of the tokenizer.
correct_pad_token (bool) – Whether to correct the pad token id.
correct_gemma2 (bool) – Whether to correct the gemma2 tokenizer.
- Returns:
The pretrained tokenizer.
- Return type:
transformers.PreTrainedTokenizer
Core functions to implement PPO algorithms. The function implemented in this file should be used by trainer with different distributed strategies to implement PPO-like algorithms.
- verl.trainer.ppo.core_algos.agg_loss(loss_mat: Tensor, loss_mask: Tensor, loss_agg_mode: str)[source]
Aggregate the loss matrix into a scalar.
- Parameters:
loss_mat – (torch.Tensor): shape: (bs, response_length)
loss_mask – (torch.Tensor): shape: (bs, response_length)
loss_agg_mode – (str) choices: method to aggregate the loss matrix into a scalar.
- Returns:
- a scalar torch.Tensor
aggregated loss
- Return type:
loss
- verl.trainer.ppo.core_algos.compute_policy_loss(old_log_prob, log_prob, advantages, response_mask, cliprange=None, cliprange_low=None, cliprange_high=None, clip_ratio_c=3.0, loss_agg_mode: str = 'token-mean')[source]
Compute the clipped policy objective and related metrics for PPO.
Adapted from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1122
- Parameters:
old_log_prob (torch.Tensor) – Log-probabilities of actions under the old policy, shape (batch_size, response_length).
log_prob (torch.Tensor) – Log-probabilities of actions under the current policy, shape (batch_size, response_length).
advantages (torch.Tensor) – Advantage estimates for each action, shape (batch_size, response_length).
response_mask (torch.Tensor) – Mask indicating which tokens to include in the loss, shape (batch_size, response_length).
cliprange (float, optional) – Clipping parameter ε for standard PPO. See https://arxiv.org/abs/1707.06347. Defaults to None (must be provided).
cliprange_low (float, optional) – Lower clip range for dual-clip PPO. Defaults to same as cliprange.
cliprange_high (float, optional) – Upper clip range for dual-clip PPO. Defaults to same as cliprange.
clip_ratio_c (float, optional) – Lower bound of the ratio for dual-clip PPO. See https://arxiv.org/pdf/1912.09729. Defaults to 3.0.
loss_agg_mode (str, optional) – Aggregation mode for agg_loss. Defaults to “token-mean”.
- verl.trainer.ppo.core_algos.kl_penalty(logprob: FloatTensor, ref_logprob: FloatTensor, kl_penalty) FloatTensor[source]
Compute KL divergence given logprob and ref_logprob. Optionally using straight through to bind k2 on other kl penalty compute method for unbiased KL gradient estimation. See more description in http://joschu.net/blog/kl-approx.html
- Parameters:
logprob
ref_logprob
- Returns:
kl_estimate
- verl.trainer.ppo.core_algos.kl_penalty(logprob: FloatTensor, ref_logprob: FloatTensor, kl_penalty) FloatTensor[source]
Compute KL divergence given logprob and ref_logprob. Optionally using straight through to bind k2 on other kl penalty compute method for unbiased KL gradient estimation. See more description in http://joschu.net/blog/kl-approx.html
- Parameters:
logprob
ref_logprob
- Returns:
kl_estimate
- verl.trainer.ppo.reward.compute_reward(data: DataProto, reward_fn: AbstractRewardManager) tuple[Tensor, dict[str, Any]][source]
Compute reward for a batch of data. :param data: DataProto object containing the input data. :param reward_fn: Reward function to compute the reward.
- Returns:
Tuple of reward tensor and extra info dictionary.
- verl.trainer.ppo.reward.load_reward_manager(config: DictConfig, tokenizer: Any, num_examine: int, **reward_kwargs: Any) AbstractRewardManager[source]
Load and initialize a reward manager based on the configuration.
- Parameters:
config – PPO trainer configuration object containing reward_model fields.
tokenizer – Tokenizer object used for processing text.
num_examine – Number of samples to examine.
**reward_kwargs – Additional keyword arguments for the reward manager.
- Returns:
An instance of the specified reward manager class.