PRISM: 学習途中における保持と相互作用の解明

要旨

本論文では、大規模言語モデルにおける学習途中の設計選択に関する包括的実証研究「PRISM」を提案する。Granite、LLaMA、Mistral、Nemotron-Hの4ファミリー、2つのアーキテクチャタイプ（密なTransformerとAttention-Mambaハイブリッド）、3Bから24Bパラメータ規模にわたる7つのベースモデルを用いた制御実験を通じて、約270億の高品質トークンによる学習途中の追加訓練（mid-training）が、数学で+15～+40ポイント、コードで+5～+12ポイント、科学分野のベンチマークで+6～+13ポイントの一貫した性能向上をもたらしつつ汎用性能を維持することを示す。PRISMから強化学習（RL）までの完全なパイプラインは、6つの推論ベンチマークのマクロ平均を12未満から29-42（3～4倍の改善）に向上させるのに対し、ほとんどのベースモデルに直接適用したRLは効果が限定的（AIMEスコアほぼゼロ）であった。データ構成が最も影響するのはRL段階ではなく学習途中の段階であり、科学データを学習途中に含めることでRL時のGPQA-Diamondスコアが+17～+28ポイント向上する一方、RL時のデータ混合比の変更は2ポイント未満の差しか生じなかった。機序的には、学習途中の訓練はモデル重みの90%以上を密に再構築するのに対し、RLは約5%のパラメータに対し疎で前倒し型の調整を加える。表現分析（CKA）により、RLが学習途中のモデル獲得した表現幾何をアーキテクチャ間で一貫して保持（CKA 0.998以上）することが確認された。決定的に、RLは開始点に関わらず同一の重み変化を適用するにもかかわらず、学習途中を経たモデルでのみ成功しており、これは学習途中の訓練がRLによる効果的な性能改善が可能なモデル配置を実現することを示唆する。我々の結果は、保持意識型学習途中訓練が信頼性の高い推論能力強化に極めて有効であることを実証し、堅牢な学習途中パイプライン設計のための実践的指針を提供する。

English

We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-training densely restructures over 90% of model weights, while RL makes sparse, front-loaded refinements to approximately 5% of parameters. Representation analysis (CKA) confirms that RL consistently preserves mid-training's representational geometry (over 0.998 CKA) across architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is highly effective for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.

PRISM: 学習途中における保持と相互作用の解明

PRISM: Demystifying Retention and Interaction in Mid-Training

要旨

Support