思維火花!:推理模型在後訓練過程中湧現的注意力頭
Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training
September 30, 2025
作者: Yein Park, Minbyul Jeong, Jaewoo Kang
cs.AI
摘要
現代大型推理模型的卓越能力,主要通過監督微調和強化學習等後訓練技術得以釋放。然而,這些改進背後的架構機制在很大程度上仍不透明。在本研究中,我們利用電路分析來展示,針對複雜推理的後訓練促成了新型、功能專注的注意力頭的湧現。這些頭共同支持結構化推理和計算。我們對Qwen系列和DeepSeek蒸餾模型的比較分析揭示,這些湧現的頭在不同訓練機制下以不同方式演化。蒸餾和監督微調促成了穩定推理頭的累積增加。相比之下,群體相對策略優化則以動態搜索模式運作:相對較少的注意力頭被迭代激活、評估和修剪,其存留緊密跟蹤任務獎勵信號的波動。此外,我們發現可控的思考開/關模型並未擁有專用的思考頭。相反,關閉顯式推理會觸發一組更廣泛但效率較低的補償頭。通過消融和定性分析,我們將這些電路層面的動態與一個關鍵的性能權衡聯繫起來:強化後的頭能夠為難題提供精妙的解決策略,但也可能引入過度思考的失敗模式,如在簡單任務上出現計算錯誤或邏輯循環。這些發現將電路層面的動態與宏觀性能相聯繫,識別出一個固有的張力:複雜推理的代價是基礎計算的可靠性。更廣泛地,我們的工作為未來訓練策略設計指明了方向,強調在發展有效推理策略的同時,需確保執行過程的可靠無誤。
English
The remarkable capabilities of modern large reasoning models are largely
unlocked through post-training techniques such as supervised fine-tuning and
reinforcement learning. However, the architectural mechanisms behind such
improvements remain largely opaque. In this work, we use circuit analysis to
demonstrate that post-training for complex reasoning sparks the emergence of
novel, functionally specialized attention heads. These heads collectively
support structured reasoning and computation. Our comparative analysis across
Qwen families and DeepSeek-distilled model reveals that these emergent heads
evolve differently under different training regimes. Distillation and SFT
foster a cumulative addition of stable reasoning heads. In contrast, group
relative policy optimization operates in a dynamic search mode: relatively few
attention heads are iteratively activated, evaluated, and pruned, with their
survival closely tracking fluctuations in the task reward signal. Furthermore,
we find that controllable think on/off models do not possess dedicated thinking
heads. Instead, turning off explicit reasoning triggers a broader-but less
efficient-set of compensatory heads. Through ablation and qualitative analyses,
we connect these circuit-level dynamics to a crucial performance trade-off:
strengthened heads enable sophisticated problem-solving strategies for
difficult problems but can also introduce over-thinking failure modes, such as
calculation errors or logical loops on simpler tasks. These findings connect
circuit-level dynamics to macro-level performance, identifying an inherent
tension where complex reasoning comes at the cost of elementary computations.
More broadly, our work points to future directions for training policy design,
emphasizing the need to balance the development of effective reasoning
strategies with the assurance of reliable, flawless execution.