论预训练、中期训练与强化学习在推理语言模型中的协同作用

摘要

近期强化学习技术在语言模型的推理能力提升方面取得了显著成果，但后训练是否真正扩展了模型在预训练之外获得的推理能力仍不明确。核心挑战在于现代训练流程缺乏可控性：大规模预训练语料不透明，中期训练常被忽视，而强化学习目标与未知的先验知识以复杂方式相互作用。为厘清这一模糊性，我们开发了完全受控的实验框架，分离预训练、中期训练和基于强化学习的后训练的因果贡献。该框架采用具有显式原子操作的合成推理任务、可解析的逐步推理轨迹，以及对训练分布的系统性操控。我们从两个维度评估模型：面向更复杂组合的外推泛化能力，以及跨表层上下文的语境泛化能力。通过该框架，我们调和了关于强化学习有效性的对立观点。研究表明：1）仅当预训练留有足够提升空间、且强化学习数据针对模型能力边界（即困难但尚未超出能力范围的任务）时，强化学习才能产生真实的能力增益（pass@128）；2）语境泛化只需最小但充分的预训练接触，此后强化学习可稳定实现能力迁移；3）在固定计算量下，中期训练较单纯强化学习能显著提升性能，证明其在训练流程中至关重要但未被充分探索的作用；4）过程级奖励能减少奖励破解现象并提升推理保真度。这些结果共同阐明了预训练、中期训练与强化学习间的相互作用，为理解和改进语言模型推理训练策略奠定了基础。

English

Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL's effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model's edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.

论预训练、中期训练与强化学习在推理语言模型中的协同作用

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

摘要

Support