PRISM:揭秘训练中期保留机制与交互作用的奥秘
PRISM: Demystifying Retention and Interaction in Mid-Training
March 17, 2026
作者: Bharat Runwal, Ashish Agrawal, Anurag Roy, Rameswar Panda
cs.AI
摘要
我们提出PRISM研究,针对大型语言模型的中期训练设计选择展开全面实证分析。通过对涵盖四个模型家族(Granite、LLaMA、Mistral、Nemotron-H)、两种架构类型(稠密Transformer与注意力-状态空间混合模型)及3B至24B参数规模的七个基础模型进行对照实验,我们发现:在约270亿高质量token上进行中期训练后,模型在数学任务上获得+15至+40分、代码任务+5至+12分、科学评测+6至+13分的稳定提升,同时保持通用性能。完整的PRISM到强化学习流程将六大推理基准的宏观平均分从不足12分提升至29-42分(提升3-4倍),而直接对多数基础模型应用强化学习的效果显著较弱,AIME评分接近零。数据构成的关键作用体现在中期训练阶段而非强化学习:中期训练加入科学数据可使强化学习后的GPQA-Diamond评分提升+17至+28分,而调整强化学习数据组合仅产生不足2分的差异。机制分析表明,中期训练会重构超过90%的模型参数,而强化学习仅对约5%的参数进行稀疏的前置优化。表征分析(CKA)证实强化学习在不同架构中均能保持中期训练的表征几何结构(CKA超过0.998)。关键发现是:尽管强化学习施加的权重改变与起点无关,但仅对经过中期训练的模型有效,这表明中期训练能将模型置于强化学习可有效提升性能的配置状态。我们的研究证明,具有保持意识的中期训练能可靠增强推理能力,并为构建稳健的中期训练流程提供了实践指导。
English
We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-training densely restructures over 90% of model weights, while RL makes sparse, front-loaded refinements to approximately 5% of parameters. Representation analysis (CKA) confirms that RL consistently preserves mid-training's representational geometry (over 0.998 CKA) across architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is highly effective for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.