JustRL:运用简易强化学习方案实现15亿参数大语言模型的规模化训练
JustRL: Scaling a 1.5B LLM with a Simple RL Recipe
December 18, 2025
作者: Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, Ning Ding, Zhiyuan Liu
cs.AI
摘要
近期大型语言模型强化学习的发展趋势日益复杂化:多阶段训练流程、动态超参数调度以及课程学习策略层出不穷。这引发了一个根本性问题:此类复杂性是否必要?我们提出JustRL方案,采用固定超参数的单一阶段极简训练法,在两个15亿参数推理模型上取得领先性能(在九项数学基准测试中平均准确率分别达54.9%和64.3%),同时计算消耗较复杂方法减少两倍。相同超参数无需调优即可跨模型迁移,4000余步训练过程呈现平滑单调提升,未出现通常需要干预的崩溃或平台期。关键的是,消融实验表明,添加显式长度惩罚、鲁棒验证器等"标准技巧"反而可能因压缩探索空间而降低性能。这些结果暗示,领域内可能正在通过增加复杂性来解决本可通过稳定、规模化基线自动消解的问题。我们公开模型与代码,旨在为学界建立一个经过验证的简易基线。
English
Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: Is this complexity necessary? We present JustRL, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2times less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.