사고의 불꽃!: 사후 훈련 중 추론 모델에서 나타나는 주의 헤드의 출현

초록

현대의 대규모 추론 모델이 보여주는 놀라운 능력은 주로 지도 미세 조정(supervised fine-tuning)과 강화 학습(reinforcement learning)과 같은 사후 훈련(post-training) 기법을 통해 발휘됩니다. 그러나 이러한 개선을 가능하게 하는 아키텍처적 메커니즘은 여전히 대부분 불투명합니다. 본 연구에서는 회로 분석(circuit analysis)을 사용하여 복잡한 추론을 위한 사후 훈련이 기능적으로 특화된 새로운 주의 헤드(attention head)의 출현을 촉발한다는 것을 입증합니다. 이러한 헤드들은 구조화된 추론과 계산을 집단적으로 지원합니다. Qwen 계열 모델과 DeepSeek-증류 모델에 대한 비교 분석을 통해, 이러한 출현 헤드들이 서로 다른 훈련 체계 하에서 다르게 진화함을 확인했습니다. 증류(distillation)와 지도 미세 조정(SFT)은 안정적인 추론 헤드의 누적적 추가를 촉진합니다. 반면, 그룹 상대 정책 최적화(group relative policy optimization)는 동적 탐색 모드로 작동합니다: 상대적으로 적은 수의 주의 헤드가 반복적으로 활성화, 평가, 제거되며, 이들의 생존은 작업 보상 신호의 변동과 밀접하게 연관됩니다. 또한, 제어 가능한 사고 켜기/끄기(think on/off) 모델은 전용 사고 헤드를 갖추고 있지 않다는 것을 발견했습니다. 대신, 명시적 추론을 끄면 더 넓지만 덜 효율적인 보상 헤드 세트가 활성화됩니다. 절제(ablation) 및 질적 분석을 통해, 이러한 회로 수준의 동적 특성을 중요한 성능 트레이드오프와 연결지었습니다: 강화된 헤드는 어려운 문제에 대한 정교한 문제 해결 전략을 가능하게 하지만, 더 간단한 작업에서 계산 오류나 논리적 루프와 같은 과도한 사고 실패 모드를 유발할 수도 있습니다. 이러한 발견은 회로 수준의 동적 특성을 거시적 성능과 연결짓고, 복잡한 추론이 기본적인 계산의 비용을 치르게 되는 본질적인 긴장을 식별합니다. 더 넓은 관점에서, 본 연구는 효과적인 추론 전략의 개발과 신뢰할 수 있고 완벽한 실행의 보장 사이의 균형을 맞추는 훈련 정책 설계의 미래 방향을 제시합니다.

English

The remarkable capabilities of modern large reasoning models are largely unlocked through post-training techniques such as supervised fine-tuning and reinforcement learning. However, the architectural mechanisms behind such improvements remain largely opaque. In this work, we use circuit analysis to demonstrate that post-training for complex reasoning sparks the emergence of novel, functionally specialized attention heads. These heads collectively support structured reasoning and computation. Our comparative analysis across Qwen families and DeepSeek-distilled model reveals that these emergent heads evolve differently under different training regimes. Distillation and SFT foster a cumulative addition of stable reasoning heads. In contrast, group relative policy optimization operates in a dynamic search mode: relatively few attention heads are iteratively activated, evaluated, and pruned, with their survival closely tracking fluctuations in the task reward signal. Furthermore, we find that controllable think on/off models do not possess dedicated thinking heads. Instead, turning off explicit reasoning triggers a broader-but less efficient-set of compensatory heads. Through ablation and qualitative analyses, we connect these circuit-level dynamics to a crucial performance trade-off: strengthened heads enable sophisticated problem-solving strategies for difficult problems but can also introduce over-thinking failure modes, such as calculation errors or logical loops on simpler tasks. These findings connect circuit-level dynamics to macro-level performance, identifying an inherent tension where complex reasoning comes at the cost of elementary computations. More broadly, our work points to future directions for training policy design, emphasizing the need to balance the development of effective reasoning strategies with the assurance of reliable, flawless execution.

사고의 불꽃!: 사후 훈련 중 추론 모델에서 나타나는 주의 헤드의 출현

Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

초록

Support