ChatPaper.aiChatPaper

強化學習中期訓練

Reinforcement Mid-Training

September 29, 2025
作者: Yijun Tian, Shaoyu Chen, Zhichao Xu, Yawei Wang, Jinhe Bi, Peng Han, Wei Wang
cs.AI

摘要

尖端大型語言模型的開發通常被理解為包含預訓練和後訓練兩個階段的過程。我們指出,需要增加一個稱為強化中期訓練的中間階段,該階段具有顯著提升性能的潛力。在本文中,我們正式定義了這一問題,並識別出三個關鍵挑戰:(1) 由於過多的推理步驟導致訓練效率低下,(2) 忽視了不平衡的詞元熵分佈,(3) 未充分利用詞元信息。為應對這些挑戰,我們提出了RMT框架,這是一個高效、自適應且統一的強化中期訓練框架,包含多項創新組件。具體而言,我們首先引入了一種動態詞元預算機制,以限制不必要的推理步驟並緩解模型過度思考的問題。其次,我們設計了一種基於課程的自適應採樣方法,促進從易到難詞元的漸進學習軌跡。最後,我們提出了一種雙重訓練策略,將強化學習與下一個詞元預測相結合,確保對關鍵詞元的針對性學習並充分利用所有詞元信息。大量實驗證明了RMT相較於現有最先進方法的優越性,在語言建模中僅使用21%的推理長度即可實現高達+64.91%的性能提升。我們還展示了強化中期訓練後獲得的檢查點能夠有益於後續的後訓練,在數學領域實現了高達+18.76%的改進。
English
The development of state-of-the-art large language models is commonly understood as a two-stage process involving pre-training and post-training. We point out the need for an additional intermediate stage called reinforcement mid-training with potential for strong performance gains. In this paper, we formally define the problem and identify three key challenges: (1) inefficient training due to excessive reasoning steps, (2) disregard of the imbalanced token entropy distribution, and (3) underutilization of token information. To address these challenges, we propose RMT, a framework for efficient, adaptive, and unified reinforcement mid-training with various innovative components. In particular, we first introduce a dynamic token budget mechanism that constrains unnecessary reasoning steps and mitigates model overthinking. Next, we design a curriculum-based adaptive sampling method that fosters a progressive learning trajectory from easy to hard tokens. Finally, we present a dual training strategy that combines reinforcement learning with next-token prediction, ensuring targeted learning on key tokens and full exploitation of all token information. Extensive experiments demonstrate the superiority of RMT over state-of-the-art methods, achieving up to +64.91% performance improvement with only 21% of the reasoning length in language modeling. We also show that checkpoints obtained after reinforcement mid-training can benefit the subsequent post-training, yielding up to +18.76% improvement in the mathematical domain.
PDF82October 10, 2025