ChatPaper.aiChatPaper

PRISM:揭秘训练中期中的留存与互动机制

PRISM: Demystifying Retention and Interaction in Mid-Training

March 17, 2026
作者: Bharat Runwal, Ashish Agrawal, Anurag Roy, Rameswar Panda
cs.AI

摘要

我们提出PRISM研究项目,这是针对大语言模型训练中期设计选择的综合性实证研究。通过对涵盖四个模型家族(Granite、LLaMA、Mistral、Nemotron-H)、两种架构类型(稠密Transformer与注意力-状态空间混合架构)、参数量级从30亿到240亿的七个基础模型进行对照实验,我们发现:在约270亿高质量token上进行中期训练后,模型在数学任务上获得+15至+40分提升,代码任务+5至+12分,科学推理任务+6至+13分,同时保持通用性能不变。完整的PRISM到强化学习流程将六大推理基准的宏观平均分从不足12分提升至29-42分(提升3-4倍),而直接对多数基础模型应用强化学习的效果显著较弱,AIME评分接近零。数据构成对中期训练的影响远大于强化学习:在中期训练中加入科学数据可使强化学习后的GPQA-Diamond评分提升+17至+28分,而调整强化学习数据组合带来的差异不足2分。机制分析表明,中期训练会重构超过90%的模型参数,而强化学习仅对约5%的参数进行稀疏的前置优化。表征分析(CKA)证实,强化学习在不同架构中均能保持中期训练的表征几何结构(CKA超过0.998)。关键发现是:尽管强化学习对不同起点模型施加相同的参数调整,但仅对完成中期训练的模型有效,这表明中期训练能将模型置于强化学习可有效提升性能的配置状态。我们的研究证明,具有参数保持特性的中期训练能可靠增强推理能力,并为构建稳健的中期训练流程提供了实践指导。
English
We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-training densely restructures over 90% of model weights, while RL makes sparse, front-loaded refinements to approximately 5% of parameters. Representation analysis (CKA) confirms that RL consistently preserves mid-training's representational geometry (over 0.998 CKA) across architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is highly effective for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.
PDF01March 20, 2026