ChatPaper.aiChatPaper

JustRL:运用简易强化学习方案实现15亿参数大语言模型的规模化训练

JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

December 18, 2025
作者: Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, Ning Ding, Zhiyuan Liu
cs.AI

摘要

大型语言模型强化学习的最新进展正日益趋向复杂化:多阶段训练流程、动态超参数调度以及课程学习策略。这引发了一个根本性问题:此种复杂性是否必要?我们提出JustRL方案,采用固定超参数的单一阶段极简训练法,在两个15亿参数推理模型上取得顶尖性能(在九项数学基准测试中平均准确率分别达54.9%和64.3%),同时计算消耗比复杂方法减少三分之二。相同超参数无需调优即可跨模型迁移,且持续4000多步的训练过程呈现平滑单调的提升曲线,未出现通常需要干预的崩溃或平台期。关键的是,消融实验表明,添加显式长度惩罚和鲁棒验证器等"标准技巧"反而可能因压制探索欲而降低性能。这些结果暗示领域内可能正在通过增加复杂性来解决本会随稳定扩增的基线而消失的问题。我们公开模型与代码,旨在为学界建立一个经过验证的简易基线。
English
Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: Is this complexity necessary? We present JustRL, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2times less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.
PDF133December 20, 2025