重新思考高效注意力在混合架构中的作用

摘要

现代语言模型越来越多地采用混合架构，将全注意力与高效注意力模块（如滑动窗口注意力（SWA）和循环序列混合器）相结合。然而，这些高效模块如何塑造模型能力仍知之甚少。为弥补这一不足，我们从三个视角对混合架构进行了系统分析：缩放行为、机制分析和架构设计。首先，从缩放视角出发，我们发现高效注意力设计主要影响长上下文能力的涌现速度，而不同混合架构在充分训练下最终能达到可比较的长上下文性能。其次，在机制层面，我们证明长距离检索主要由全注意力承担，而高效注意力则塑造其优化轨迹。这解释了我们在全注意力层中观察到的一个反直觉现象——"大窗口惰性"：更大的滑动窗口注意力（SWA）窗口可能延迟检索头在全注意力层中的形成。最后，受这一机制启发，我们表明：在具有小窗口滑动注意力（SWA）的混合架构中，仅对全注意力层应用NoPE（无位置编码），能在对短上下文性能影响极小的情况下显著提升长上下文性能。

English

Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.