PRISM：揭秘训练中期中的留存与互动机制

摘要

我们提出PRISM研究项目，这是针对大语言模型训练中期设计选择的综合性实证研究。通过对涵盖四个模型家族（Granite、LLaMA、Mistral、Nemotron-H）、两种架构类型（稠密Transformer与注意力-状态空间混合架构）、参数量级从30亿到240亿的七个基础模型进行对照实验，我们发现：在约270亿高质量token上进行中期训练后，模型在数学任务上获得+15至+40分提升，代码任务+5至+12分，科学推理任务+6至+13分，同时保持通用性能不变。完整的PRISM到强化学习流程将六大推理基准的宏观平均分从不足12分提升至29-42分（提升3-4倍），而直接对多数基础模型应用强化学习的效果显著较弱，AIME评分接近零。数据构成对中期训练的影响远大于强化学习：在中期训练中加入科学数据可使强化学习后的GPQA-Diamond评分提升+17至+28分，而调整强化学习数据组合带来的差异不足2分。机制分析表明，中期训练会重构超过90%的模型参数，而强化学习仅对约5%的参数进行稀疏的前置优化。表征分析（CKA）证实，强化学习在不同架构中均能保持中期训练的表征几何结构（CKA超过0.998）。关键发现是：尽管强化学习对不同起点模型施加相同的参数调整，但仅对完成中期训练的模型有效，这表明中期训练能将模型置于强化学习可有效提升性能的配置状态。我们的研究证明，具有参数保持特性的中期训练能可靠增强推理能力，并为构建稳健的中期训练流程提供了实践指导。

English

We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-training densely restructures over 90% of model weights, while RL makes sparse, front-loaded refinements to approximately 5% of parameters. Representation analysis (CKA) confirms that RL consistently preserves mid-training's representational geometry (over 0.998 CKA) across architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is highly effective for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.