Reinforce-Ada:面向强化式大语言模型训练的自适应采样框架
Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training
October 6, 2025
作者: Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, Tong Zhang
cs.AI
摘要
将强化学习应用于大型语言模型(LLMs)以执行推理任务时,常因对提示的固定且均匀响应采样而导致梯度估计不稳定,成为性能瓶颈。先前的研究如GVM-RAFT通过动态分配每个提示的推理预算,在预算约束下最小化随机梯度方差,解决了这一问题。受此启发,我们提出了Reinforce-Ada,一种用于LLMs在线强化学习后训练的自适应采样框架,该框架持续将采样努力重新分配到具有最大不确定性或学习潜力的提示上。与传统的两阶段分配方法不同,Reinforce-Ada在在线连续淘汰过程中交替进行估计与采样,并在收集到足够信号后自动停止对某一提示的采样。为了稳定更新,我们构建了具有强制奖励多样性的固定大小组,并利用自适应采样阶段聚合的全局统计数据计算优势基线。跨多种模型架构和推理基准的实证结果表明,与GRPO相比,Reinforce-Ada加速了收敛并提升了最终性能,尤其是在使用平衡采样变体时。我们的工作强调了方差感知、自适应数据管理在实现具备推理能力的LLMs高效可靠强化学习中的核心作用。代码可在https://github.com/RLHFlow/Reinforce-Ada获取。
English
Reinforcement learning applied to large language models (LLMs) for reasoning
tasks is often bottlenecked by unstable gradient estimates due to fixed and
uniform sampling of responses across prompts. Prior work such as GVM-RAFT
addresses this by dynamically allocating inference budget per prompt to
minimize stochastic gradient variance under a budget constraint. Inspired by
this insight, we propose Reinforce-Ada, an adaptive sampling framework for
online RL post-training of LLMs that continuously reallocates sampling effort
to the prompts with the greatest uncertainty or learning potential. Unlike
conventional two-stage allocation methods, Reinforce-Ada interleaves estimation
and sampling in an online successive elimination process, and automatically
stops sampling for a prompt once sufficient signal is collected. To stabilize
updates, we form fixed-size groups with enforced reward diversity and compute
advantage baselines using global statistics aggregated over the adaptive
sampling phase. Empirical results across multiple model architectures and
reasoning benchmarks show that Reinforce-Ada accelerates convergence and
improves final performance compared to GRPO, especially when using the balanced
sampling variant. Our work highlights the central role of variance-aware,
adaptive data curation in enabling efficient and reliable reinforcement
learning for reasoning-capable LLMs. Code is available at
https://github.com/RLHFlow/Reinforce-Ada.