ChatPaper.aiChatPaper

重新思考高效注意力在混合架构中的作用

Rethinking the Role of Efficient Attention in Hybrid Architectures

June 13, 2026
作者: Ziqing Qiao, Yinuo Xu, Chaojun Xiao, Zhou Su, Zihan Zhou, Yingfa Chen, Xiaoyue Xu, Xu Han, Zhiyuan Liu
cs.AI

摘要

现代语言模型越来越多地采用混合架构,将全注意力与高效注意力模块(如滑动窗口注意力(SWA)和循环序列混合器)相结合。然而,这些高效模块如何塑造模型能力仍知之甚少。为弥补这一不足,我们从三个视角对混合架构进行了系统分析:缩放行为、机制分析和架构设计。首先,从缩放视角出发,我们发现高效注意力设计主要影响长上下文能力的涌现速度,而不同混合架构在充分训练下最终能达到可比较的长上下文性能。其次,在机制层面,我们证明长距离检索主要由全注意力承担,而高效注意力则塑造其优化轨迹。这解释了我们在全注意力层中观察到的一个反直觉现象——"大窗口惰性":更大的滑动窗口注意力(SWA)窗口可能延迟检索头在全注意力层中的形成。最后,受这一机制启发,我们表明:在具有小窗口滑动注意力(SWA)的混合架构中,仅对全注意力层应用NoPE(无位置编码),能在对短上下文性能影响极小的情况下显著提升长上下文性能。
English
Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.