Reinforce-Ada：面向强化式大语言模型训练的自适应采样框架

摘要

将强化学习应用于大型语言模型（LLMs）以执行推理任务时，常因对提示的固定且均匀响应采样而导致梯度估计不稳定，成为性能瓶颈。先前的研究如GVM-RAFT通过动态分配每个提示的推理预算，在预算约束下最小化随机梯度方差，解决了这一问题。受此启发，我们提出了Reinforce-Ada，一种用于LLMs在线强化学习后训练的自适应采样框架，该框架持续将采样努力重新分配到具有最大不确定性或学习潜力的提示上。与传统的两阶段分配方法不同，Reinforce-Ada在在线连续淘汰过程中交替进行估计与采样，并在收集到足够信号后自动停止对某一提示的采样。为了稳定更新，我们构建了具有强制奖励多样性的固定大小组，并利用自适应采样阶段聚合的全局统计数据计算优势基线。跨多种模型架构和推理基准的实证结果表明，与GRPO相比，Reinforce-Ada加速了收敛并提升了最终性能，尤其是在使用平衡采样变体时。我们的工作强调了方差感知、自适应数据管理在实现具备推理能力的LLMs高效可靠强化学习中的核心作用。代码可在https://github.com/RLHFlow/Reinforce-Ada获取。

English

Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates due to fixed and uniform sampling of responses across prompts. Prior work such as GVM-RAFT addresses this by dynamically allocating inference budget per prompt to minimize stochastic gradient variance under a budget constraint. Inspired by this insight, we propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs that continuously reallocates sampling effort to the prompts with the greatest uncertainty or learning potential. Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, and automatically stops sampling for a prompt once sufficient signal is collected. To stabilize updates, we form fixed-size groups with enforced reward diversity and compute advantage baselines using global statistics aggregated over the adaptive sampling phase. Empirical results across multiple model architectures and reasoning benchmarks show that Reinforce-Ada accelerates convergence and improves final performance compared to GRPO, especially when using the balanced sampling variant. Our work highlights the central role of variance-aware, adaptive data curation in enabling efficient and reliable reinforcement learning for reasoning-capable LLMs. Code is available at https://github.com/RLHFlow/Reinforce-Ada.

Reinforce-Ada：面向强化式大语言模型训练的自适应采样框架

Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

摘要

Support