SCALER:用于推理的合成可扩展自适应学习环境
SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning
January 8, 2026
作者: Caijun Xu, Changyi Xiao, Zhongyuan Peng, Xinrun Wang, Yixin Cao
cs.AI
摘要
強化學習(RL)為提升大型語言模型的推理能力提供了原則性方法,但其有效性取決於能隨模型演化而持續提供信息的訓練信號。實踐中,當任務難度與模型能力失配,或訓練被少量重複問題模式主導時,RL的進展往往放緩。為協同解決這些問題,我們提出SCALER(可擴展合成式自適應推理學習環境),該框架通過自適應環境設計來維持有效的學習信號。SCALER引入可擴展的合成流水線,將現實編程問題轉化為具可控難度與無限實例生成能力的可驗證推理環境,使RL訓練能突破有限數據集的限制,同時保持強正確性保證。在此基礎上,SCALER進一步採用自適應多環境RL策略,動態調整實例難度並策展活躍環境集合,以追蹤模型能力前沿並維持分佈多樣性。這種協同適應機制能防止獎勵稀疏性,減輕對狹窄任務模式的過擬合,支持整個訓練過程的持續改進。大量實驗表明,SCALER在多樣化推理基準測試中始終優於基於數據集的RL基線,並展現出更穩定、更長視距的訓練動態。
English
Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability, or when training is dominated by a narrow set of recurring problem patterns. To jointly address these issues, we propose SCALER (Synthetic sCalable Adaptive Learning Environment for Reasoning), a framework that sustains effective learning signals through adaptive environment design. SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model's capability frontier and maintain distributional diversity. This co-adaptation prevents reward sparsity, mitigates overfitting to narrow task patterns, and supports sustained improvement throughout training. Extensive experiments show that SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.