BroRL:通过拓宽探索实现强化学习的规模化
BroRL: Scaling Reinforcement Learning via Broadened Exploration
October 1, 2025
作者: Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Harchaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, Jun Yang, Jan Kautz, Yi Dong
cs.AI
摘要
可验证奖励强化学习(RLVR)已成为解锁大型语言模型复杂推理能力的关键要素。近期研究ProRL通过增加训练步数展现了扩展RL的潜力。然而,在数千步训练后,性能趋于平稳,继续投入更多计算资源进行额外训练带来的收益明显递减。本研究探索了一种互补的RL扩展范式——BroRL,即通过将每个样本的探索次数提升至数百次,以彻底拓宽探索范围,从而在ProRL因训练步数增加而达到的性能饱和点之外,实现持续的性能提升。我们的方法基于质量平衡方程分析,使我们能够刻画强化过程中正确与错误标记概率质量的变化速率。研究表明,在一步RL假设下,采样探索标记始终促进正确质量扩展,而探索之外未采样的标记则可能根据其分布及净奖励平衡带来增益或损失。关键的是,随着每个样本的探索次数N增加,未采样项的影响减弱,确保了整体正确质量的扩展。为验证理论分析,我们在更为宽松的条件下进行模拟,发现足够大的探索规模N——对应充分的探索——能保证所有正确标记概率质量的提升。实证表明,BroRL使经过3K步ProRL训练后饱和的模型重获新生,并展现出稳健、持续的改进,在1.5B模型上跨多个基准测试中取得了当前最优的结果。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key
ingredient for unlocking complex reasoning capabilities in large language
models. Recent work ProRL has shown promise in scaling RL by increasing the
number of training steps. However, performance plateaus after thousands of
steps, with clear diminishing returns from allocating more computation to
additional training. In this work, we investigate a complementary paradigm for
scaling RL, BroR-Lincreasing the number of rollouts per example to hundreds to
exhaustively Broaden exploration, which yields continuous performance gains
beyond the saturation point observed in ProRL when scaling the number of
training steps. Our approach is motivated by a mass balance equation analysis
allowing us to characterize the rate of change in probability mass for correct
and incorrect tokens during the reinforcement process. We show that under a
one-step RL assumption, sampled rollout tokens always contribute to
correct-mass expansion, while unsampled tokens outside rollouts may lead to
gains or losses depending on their distribution and the net reward balance.
Importantly, as the number of rollouts per example N increases, the effect of
unsampled terms diminishes, ensuring overall correct-mass expansion. To
validate our theoretical analysis, we conduct simulations under more relaxed
conditions and find that a sufficiently large rollout size N-corresponding to
ample exploration-guarantees an increase in the probability mass of all correct
tokens. Empirically, BroRL revives models saturated after 3K ProRL training
steps and demonstrates robust, continuous improvement, achieving
state-of-the-art results for the 1.5B model across diverse benchmarks.