思維火花！：推理模型在後訓練過程中湧現的注意力頭

摘要

現代大型推理模型的卓越能力，主要通過監督微調和強化學習等後訓練技術得以釋放。然而，這些改進背後的架構機制在很大程度上仍不透明。在本研究中，我們利用電路分析來展示，針對複雜推理的後訓練促成了新型、功能專注的注意力頭的湧現。這些頭共同支持結構化推理和計算。我們對Qwen系列和DeepSeek蒸餾模型的比較分析揭示，這些湧現的頭在不同訓練機制下以不同方式演化。蒸餾和監督微調促成了穩定推理頭的累積增加。相比之下，群體相對策略優化則以動態搜索模式運作：相對較少的注意力頭被迭代激活、評估和修剪，其存留緊密跟蹤任務獎勵信號的波動。此外，我們發現可控的思考開/關模型並未擁有專用的思考頭。相反，關閉顯式推理會觸發一組更廣泛但效率較低的補償頭。通過消融和定性分析，我們將這些電路層面的動態與一個關鍵的性能權衡聯繫起來：強化後的頭能夠為難題提供精妙的解決策略，但也可能引入過度思考的失敗模式，如在簡單任務上出現計算錯誤或邏輯循環。這些發現將電路層面的動態與宏觀性能相聯繫，識別出一個固有的張力：複雜推理的代價是基礎計算的可靠性。更廣泛地，我們的工作為未來訓練策略設計指明了方向，強調在發展有效推理策略的同時，需確保執行過程的可靠無誤。

English

The remarkable capabilities of modern large reasoning models are largely unlocked through post-training techniques such as supervised fine-tuning and reinforcement learning. However, the architectural mechanisms behind such improvements remain largely opaque. In this work, we use circuit analysis to demonstrate that post-training for complex reasoning sparks the emergence of novel, functionally specialized attention heads. These heads collectively support structured reasoning and computation. Our comparative analysis across Qwen families and DeepSeek-distilled model reveals that these emergent heads evolve differently under different training regimes. Distillation and SFT foster a cumulative addition of stable reasoning heads. In contrast, group relative policy optimization operates in a dynamic search mode: relatively few attention heads are iteratively activated, evaluated, and pruned, with their survival closely tracking fluctuations in the task reward signal. Furthermore, we find that controllable think on/off models do not possess dedicated thinking heads. Instead, turning off explicit reasoning triggers a broader-but less efficient-set of compensatory heads. Through ablation and qualitative analyses, we connect these circuit-level dynamics to a crucial performance trade-off: strengthened heads enable sophisticated problem-solving strategies for difficult problems but can also introduce over-thinking failure modes, such as calculation errors or logical loops on simpler tasks. These findings connect circuit-level dynamics to macro-level performance, identifying an inherent tension where complex reasoning comes at the cost of elementary computations. More broadly, our work points to future directions for training policy design, emphasizing the need to balance the development of effective reasoning strategies with the assurance of reliable, flawless execution.

思維火花！：推理模型在後訓練過程中湧現的注意力頭

Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

摘要

Support