RLVE:基於自適應可驗證環境擴展語言模型的強化學習規模
RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments
November 10, 2025
作者: Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, Hannaneh Hajishirzi
cs.AI
摘要
我们提出具有自适应可验证环境的强化学习(RLVE),该方法通过可验证环境动态生成问题并提供算法可验证的奖励机制,从而扩展语言模型的强化学习规模。RLVE使每个可验证环境能够根据策略模型在训练过程中的能力水平,动态调整其问题难度分布。相比之下,静态数据分布往往因问题对策略模型过于简单或困难而导致学习信号消失。为实现RLVE,我们创建了RLVE-Gym——一个通过人工环境工程精心开发的大规模套件,包含400个可验证环境。借助RLVE-Gym,我们证明环境扩展(即增加训练环境集合)能持续提升模型的泛化推理能力。在RLVE-Gym全部400个环境中进行联合训练的RLVE方法,从当前最强的15亿参数推理语言模型之一起步,在六个推理基准测试中实现了3.37%的绝对平均提升。相比之下,延续该语言模型原有强化学习训练仅获得0.49%的平均绝对增益,尽管其计算消耗超过RLVE的三倍。我们已公开代码。
English
We introduce Reinforcement Learning (RL) with Adaptive Verifiable
Environments (RLVE), an approach using verifiable environments that
procedurally generate problems and provide algorithmically verifiable rewards,
to scale up RL for language models (LMs). RLVE enables each verifiable
environment to dynamically adapt its problem difficulty distribution to the
policy model's capabilities as training progresses. In contrast, static data
distributions often lead to vanishing learning signals when problems are either
too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a
large-scale suite of 400 verifiable environments carefully developed through
manual environment engineering. Using RLVE-Gym, we show that environment
scaling, i.e., expanding the collection of training environments, consistently
improves generalizable reasoning capabilities. RLVE with joint training across
all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement
across six reasoning benchmarks, starting from one of the strongest 1.5B
reasoning LMs. By comparison, continuing this LM's original RL training yields
only a 0.49% average absolute gain despite using over 3x more compute. We
release our code publicly.