HydraHead: 从头的功能异质性到专业化注意力混合

摘要

注意力机制的二次复杂度构成了长上下文处理的关键瓶颈，这激发了人们对混合注意力设计的兴趣。大多数开源混合模型采用层级策略。然而，先前的工作已注意到将线性注意力与全注意力整合的内在困难，表明注意力混合的设计空间仍未得到充分探索。为了探索这一空间，我们进行了可解释性分析，并观察到不同层表现出块级功能相似性，而同一层内的各个头部尽管共享输入特征，却显示出不同的功能特化。这种头级异质性表明，头部维度为融合异构注意力信号提供了自然而原则性的粒度。基于这一洞察，我们引入了HydraHead，一种沿头部轴混合全注意力与线性注意力的新颖架构。HydraHead包含两项关键创新：（1）一种基于可解释性的选择策略，用于识别对检索至关重要的头部并仅对其保留全注意力；（2）一个尺度归一化融合模块，用于调和全注意力与线性注意力头部输出之间的分布差异。通过利用参数复用和蒸馏的三阶段迁移流程，我们以最小的训练开销实现了高性能混合模型。在统一的训练设置下，HydraHead在长上下文任务上优于其他混合设计，同时保持了强大的通用推理能力。通过可解释性驱动的头部选择，HydraHead以7:1的线性注意力与全注意力比例，达到了3:1层级混合模型的长上下文性能。关键在于，仅在150亿词元上训练的HydraHead，在512K上下文长度下相比基线提升了超过69%，接近同等规模且原生上下文长度为256K的领先模型Qwen3.5。这凸显了头级混合的巨大扩展潜力。

English

The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.