SCALER:合成式可扩展自适应推理学习环境
SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning
January 8, 2026
作者: Caijun Xu, Changyi Xiao, Zhongyuan Peng, Xinrun Wang, Yixin Cao
cs.AI
摘要
强化学习(RL)为增强大语言模型的推理能力提供了系统化方法,但其有效性依赖于能够随模型进化保持信息量的训练信号。实践中,当任务难度与模型能力失配或训练被少量重复问题模式主导时,RL进展往往受阻。为协同解决这些问题,我们提出SCALER(可扩展自适应推理合成学习环境),该框架通过自适应环境设计维持高效学习信号。SCALER引入可扩展的合成流程,将现实编程问题转化为具有可控难度和无限实例生成能力的可验证推理环境,使RL训练突破有限数据集限制的同时保持强正确性保证。在此基础上,SCALER进一步采用自适应多环境RL策略,动态调整实例难度并筛选活跃环境集合,以追踪模型能力边界并维持分布多样性。这种协同适应机制避免了奖励稀疏性,缓解了对狭窄任务模式的过拟合,支撑了整个训练周期的持续改进。大量实验表明,SCALER在多样化推理基准测试中始终优于基于数据集的RL基线,并展现出更稳定的长周期训练动态。
English
Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability, or when training is dominated by a narrow set of recurring problem patterns. To jointly address these issues, we propose SCALER (Synthetic sCalable Adaptive Learning Environment for Reasoning), a framework that sustains effective learning signals through adaptive environment design. SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model's capability frontier and maintain distributional diversity. This co-adaptation prevents reward sparsity, mitigates overfitting to narrow task patterns, and supports sustained improvement throughout training. Extensive experiments show that SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.