HydraHead: 헤드 수준의 기능적 이질성에서 특화된 어텐션 하이브리드화로

초록

어텐션의 2차 복잡도는 장문맥 처리에 중요한 병목 현상을 야기하며, 하이브리드 어텐션 설계에 대한 관심을 불러일으키고 있다. 대부분의 오픈소스 하이브리드 모델은 계층별 전략을 채택한다. 그러나 선행 연구에서는 선형 어텐션(LA)과 완전 어텐션(FA)을 통합하는 데 본질적인 어려움이 있음을 지적하며, 어텐션 혼성화의 설계 공간이 아직 충분히 탐구되지 않았음을 시사한다. 이 공간을 탐구하기 위해 우리는 해석 가능성 분석을 수행하였고, 계층들이 블록 단위 기능적 유사성을 보이는 반면, 동일 계층 내의 개별 헤드는 입력 특성을 공유함에도 불구하고 뚜렷한 기능적 전문화를 나타냄을 관찰하였다. 이러한 헤드 수준의 이질성은 헤드 차원이 이질적인 어텐션 신호를 융합하기 위한 자연스럽고 원칙적인 세분성(granularity)을 제공함을 시사한다. 이 통찰을 바탕으로, 우리는 FA와 LA를 헤드 축을 따라 혼성화하는 새로운 아키텍처인 HydraHead를 소개한다. HydraHead는 두 가지 핵심 혁신을 특징으로 한다: (1) 검색에 중요한 헤드를 식별하고 이들에 대해서만 FA를 유지하는 해석 가능성 기반 선택 전략, 그리고 (2) FA와 LA 헤드 출력 간의 분포 차이를 조정하는 스케일 정규화 융합 모듈이다. 매개변수 재사용과 증류를 활용한 3단계 전이 파이프라인을 통해 최소한의 훈련 오버헤드로 고성능 하이브리드 모델을 달성한다. 통합된 훈련 설정에서 HydraHead는 강력한 일반 추론 능력을 유지하면서 장문맥 작업에서 다른 하이브리드 설계를 능가한다. 해석 가능성 기반 헤드 선택을 통해 7:1의 LA 대 FA 비율에서 3:1 계층별 하이브리드의 장문맥 성능과 일치한다. 중요한 점은 150억 토큰만으로 훈련된 HydraHead가 512K 문맥 길이에서 기준 대비 69% 이상의 개선을 달성하며, 기본 문맥 길이가 256K인 동급 규모의 선도 모델 Qwen3.5에 근접한다는 것이다. 이는 헤드 수준 혼성화의 상당한 확장 가능성을 강조한다.

English

The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.