思考の火花！：ポストトレーニング中の推論モデルにおける創発的アテンションヘッド

要旨

現代の大規模推論モデルの驚異的な能力は、主に教師ありファインチューニングや強化学習といったポストトレーニング技術によって解き放たれています。しかし、そのような改善の背後にあるアーキテクチャのメカニズムは、依然としてほとんど解明されていません。本研究では、回路分析を用いて、複雑な推論のためのポストトレーニングが、新たな機能特化型アテンションヘッドの出現を促すことを示します。これらのヘッドは、構造化された推論と計算を集合的にサポートします。QwenファミリーとDeepSeek蒸留モデルにわたる比較分析により、これらの出現ヘッドが異なるトレーニング体制の下で異なる進化を遂げることが明らかになりました。蒸留とSFTは、安定した推論ヘッドの累積的な追加を促進します。一方、グループ相対ポリシー最適化は、動的な探索モードで動作します：比較的少数のアテンションヘッドが反復的に活性化、評価、剪定され、その生存がタスクの報酬信号の変動に密接に追従します。さらに、制御可能な思考オン/オフモデルには、専用の思考ヘッドが存在しないことがわかりました。代わりに、明示的な推論をオフにすると、より広範だが効率の低い補償ヘッドのセットがトリガーされます。アブレーションと質的分析を通じて、これらの回路レベルのダイナミクスを重要なパフォーマンスのトレードオフに結びつけます：強化されたヘッドは、難しい問題に対する洗練された問題解決戦略を可能にしますが、より単純なタスクでの計算ミスや論理ループといった過剰思考の失敗モードを引き起こす可能性もあります。これらの発見は、回路レベルのダイナミクスをマクロレベルのパフォーマンスに結びつけ、複雑な推論が基本的な計算のコストを伴うという固有の緊張関係を特定します。より広く、我々の研究は、効果的な推論戦略の開発と信頼性のある完璧な実行の保証のバランスを取る必要性を強調し、将来のトレーニングポリシー設計の方向性を示しています。

English

The remarkable capabilities of modern large reasoning models are largely unlocked through post-training techniques such as supervised fine-tuning and reinforcement learning. However, the architectural mechanisms behind such improvements remain largely opaque. In this work, we use circuit analysis to demonstrate that post-training for complex reasoning sparks the emergence of novel, functionally specialized attention heads. These heads collectively support structured reasoning and computation. Our comparative analysis across Qwen families and DeepSeek-distilled model reveals that these emergent heads evolve differently under different training regimes. Distillation and SFT foster a cumulative addition of stable reasoning heads. In contrast, group relative policy optimization operates in a dynamic search mode: relatively few attention heads are iteratively activated, evaluated, and pruned, with their survival closely tracking fluctuations in the task reward signal. Furthermore, we find that controllable think on/off models do not possess dedicated thinking heads. Instead, turning off explicit reasoning triggers a broader-but less efficient-set of compensatory heads. Through ablation and qualitative analyses, we connect these circuit-level dynamics to a crucial performance trade-off: strengthened heads enable sophisticated problem-solving strategies for difficult problems but can also introduce over-thinking failure modes, such as calculation errors or logical loops on simpler tasks. These findings connect circuit-level dynamics to macro-level performance, identifying an inherent tension where complex reasoning comes at the cost of elementary computations. More broadly, our work points to future directions for training policy design, emphasizing the need to balance the development of effective reasoning strategies with the assurance of reliable, flawless execution.

思考の火花！：ポストトレーニング中の推論モデルにおける創発的アテンションヘッド

Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

要旨

Support