HydraHead：ヘッドレベルの機能的異質性から特化注意のハイブリダイゼーションへ

要旨

アテンションの二次の計算量は、長文脈処理における重大なボトルネックであり、ハイブリッドアテンション設計への関心を高めている。ほとんどのオープンソースのハイブリッドモデルは層単位の戦略を採用している。しかし、先行研究では線形アテンション（LA）と完全アテンション（FA）の統合に内在する困難さが指摘されており、アテンションのハイブリッド化の設計空間は未だ十分に探求されていないことを示唆している。この空間を調査するため、我々は解釈可能性分析を実施し、層がブロック単位の機能類似性を示す一方、同一層内の個々のヘッドは入力特徴を共有しているにもかかわらず、明確な機能特化を示すことを観測した。このヘッドレベルの異質性は、ヘッド次元が異種アテンション信号を融合するための自然かつ原理的な粒度を提供することを示唆する。この知見に基づき、我々はFAとLAをヘッド軸に沿ってハイブリッド化する新規アーキテクチャであるHydraHeadを導入する。HydraHeadは二つの主要な革新を特徴とする：（1）検索に重要なヘッドを特定し、それらに対してのみFAを保持する解釈可能性駆動の選択戦略、（2）FAとLAのヘッド出力間の分布ギャップを調整するスケール正規化融合モジュールである。パラメータ再利用と蒸留を備えた三段階転送パイプラインを活用することで、最小限のトレーニングオーバーヘッドで高性能なハイブリッドモデルを実現する。統一されたトレーニング設定のもと、HydraHeadは強力な汎用推論を維持しつつ、長文脈タスクにおいて他のハイブリッド設計を凌駕する。解釈可能性駆動によるヘッド選択により、7:1のLA対FA比で、3:1の層単位ハイブリッドの長文脈性能に匹敵する。重要なことに、わずか15BトークンでトレーニングされたHydraHeadは、512Kのコンテキスト長でベースライン比69%以上の改善を達成し、ネイティブコンテキスト長256Kの同等規模の代表的モデルであるQwen3.5に迫る。これは、ヘッドレベルのハイブリッド化が持つ顕著なスケーリング可能性を浮き彫りにしている。

English

The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.