ChatPaper.aiChatPaper

强化中期训练

Reinforcement Mid-Training

September 29, 2025
作者: Yijun Tian, Shaoyu Chen, Zhichao Xu, Yawei Wang, Jinhe Bi, Peng Han, Wei Wang
cs.AI

摘要

当前最先进的大型语言模型开发通常被理解为一个包含预训练和后训练的两阶段过程。我们指出,在此过程中需要增加一个称为强化中期训练的中间阶段,该阶段具有显著提升性能的潜力。本文正式定义了这一问题,并识别出三个关键挑战:(1) 因过多推理步骤导致的训练效率低下,(2) 对不平衡的令牌熵分布缺乏考虑,(3) 令牌信息利用不足。针对这些挑战,我们提出了RMT框架,这是一个高效、自适应且统一的强化中期训练框架,包含多项创新组件。具体而言,我们首先引入了一种动态令牌预算机制,以限制不必要的推理步骤并缓解模型的过度思考。接着,我们设计了一种基于课程的适应性采样方法,促进从易到难的令牌渐进学习路径。最后,我们提出了一种双重训练策略,将强化学习与下一令牌预测相结合,确保对关键令牌的针对性学习及所有令牌信息的充分利用。大量实验证明,RMT在语言建模任务中优于现有最先进方法,仅使用21%的推理长度即可实现高达+64.91%的性能提升。我们还展示了强化中期训练后获得的检查点能够有益于后续的后训练,在数学领域带来高达+18.76%的改进。
English
The development of state-of-the-art large language models is commonly understood as a two-stage process involving pre-training and post-training. We point out the need for an additional intermediate stage called reinforcement mid-training with potential for strong performance gains. In this paper, we formally define the problem and identify three key challenges: (1) inefficient training due to excessive reasoning steps, (2) disregard of the imbalanced token entropy distribution, and (3) underutilization of token information. To address these challenges, we propose RMT, a framework for efficient, adaptive, and unified reinforcement mid-training with various innovative components. In particular, we first introduce a dynamic token budget mechanism that constrains unnecessary reasoning steps and mitigates model overthinking. Next, we design a curriculum-based adaptive sampling method that fosters a progressive learning trajectory from easy to hard tokens. Finally, we present a dual training strategy that combines reinforcement learning with next-token prediction, ensuring targeted learning on key tokens and full exploitation of all token information. Extensive experiments demonstrate the superiority of RMT over state-of-the-art methods, achieving up to +64.91% performance improvement with only 21% of the reasoning length in language modeling. We also show that checkpoints obtained after reinforcement mid-training can benefit the subsequent post-training, yielding up to +18.76% improvement in the mathematical domain.
PDF82October 10, 2025