論預訓練、中期訓練與強化學習在推理語言模型中的交互作用

摘要

近期強化學習（RL）技術雖在語言模型的推理能力上帶來顯著提升，但後訓練是否真能擴展模型超越預訓練所得的推理能力，目前尚不明確。核心挑戰在於現代訓練流程缺乏可控性：大規模預訓練語料庫不透明，中期訓練常被忽視，且RL目標會與未知的先驗知識產生複雜交互。為釐清此模糊性，我們建立了一個全受控的實驗框架，分離預訓練、中期訓練與RL後訓練的因果貢獻。我們採用具明確原子操作、可解析逐步推理軌跡的合成推理任務，並系統性操控訓練資料分佈。從兩個維度評估模型：對更複雜組合的外推泛化能力，以及跨表面語境的上下文泛化能力。透過此框架，我們調和了關於RL效能的對立觀點。研究發現：1）僅當預訓練留有足夠提升空間，且RL資料針對模型能力邊界（處於困難但尚未無法觸及的任務）時，RL才能產生真實的能力增益（pass@128）；2）上下文泛化只需最少但充分的預訓練接觸，此後RL即可實現可靠遷移；3）在固定計算量下，中期訓練相比純RL能顯著提升效能，彰顯其在訓練流程中關鍵卻未被充分探索的角色；4）過程級獎勵能降低獎勵破解現象並提升推理保真度。這些結果共同闡明了預訓練、中期訓練與RL間的相互作用，為理解與改進推理型語言模型的訓練策略奠定基礎。

English

Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL's effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model's edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.

論預訓練、中期訓練與強化學習在推理語言模型中的交互作用

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

摘要

Support