思维火花！：推理模型在训练后涌现的注意力机制

摘要

现代大型推理模型的卓越能力主要通过在训练后采用监督微调和强化学习等技术得以释放。然而，这些改进背后的架构机制在很大程度上仍不透明。在本研究中，我们运用电路分析技术揭示，针对复杂推理的训练后处理催生了一类新型、功能专一的注意力头。这些注意力头共同支撑起结构化的推理与计算。通过对Qwen系列模型与DeepSeek蒸馏模型的对比分析，我们发现这些新兴注意力头在不同训练策略下呈现出不同的演化路径。蒸馏与监督微调促使稳定的推理头逐步累积；而群体相对策略优化则处于一种动态搜索模式：相对较少的注意力头被迭代激活、评估与剪枝，其存续紧密跟随任务奖励信号的波动。此外，我们发现可控的“思考开关”模型并不具备专门的思考头。相反，关闭显式推理会触发一组更广泛但效率较低的补偿性注意力头。通过消融实验与定性分析，我们将这些电路层面的动态与一个关键的性能权衡联系起来：增强的注意力头虽能助力解决复杂问题，但也可能引入过度思考的失败模式，如在简单任务上出现计算错误或逻辑循环。这些发现将电路层面的动态与宏观性能表现相连接，揭示了一个内在矛盾：复杂推理的获得往往以基础计算能力的削弱为代价。更广泛而言，我们的研究为未来训练策略的设计指明了方向，强调在开发有效推理策略的同时，需确保执行的可靠性与无差错性。

English

The remarkable capabilities of modern large reasoning models are largely unlocked through post-training techniques such as supervised fine-tuning and reinforcement learning. However, the architectural mechanisms behind such improvements remain largely opaque. In this work, we use circuit analysis to demonstrate that post-training for complex reasoning sparks the emergence of novel, functionally specialized attention heads. These heads collectively support structured reasoning and computation. Our comparative analysis across Qwen families and DeepSeek-distilled model reveals that these emergent heads evolve differently under different training regimes. Distillation and SFT foster a cumulative addition of stable reasoning heads. In contrast, group relative policy optimization operates in a dynamic search mode: relatively few attention heads are iteratively activated, evaluated, and pruned, with their survival closely tracking fluctuations in the task reward signal. Furthermore, we find that controllable think on/off models do not possess dedicated thinking heads. Instead, turning off explicit reasoning triggers a broader-but less efficient-set of compensatory heads. Through ablation and qualitative analyses, we connect these circuit-level dynamics to a crucial performance trade-off: strengthened heads enable sophisticated problem-solving strategies for difficult problems but can also introduce over-thinking failure modes, such as calculation errors or logical loops on simpler tasks. These findings connect circuit-level dynamics to macro-level performance, identifying an inherent tension where complex reasoning comes at the cost of elementary computations. More broadly, our work points to future directions for training policy design, emphasizing the need to balance the development of effective reasoning strategies with the assurance of reliable, flawless execution.

思维火花！：推理模型在训练后涌现的注意力机制

Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

摘要

Support