重新思考高效注意力在混合架構中的角色
Rethinking the Role of Efficient Attention in Hybrid Architectures
June 13, 2026
作者: Ziqing Qiao, Yinuo Xu, Chaojun Xiao, Zhou Su, Zihan Zhou, Yingfa Chen, Xiaoyue Xu, Xu Han, Zhiyuan Liu
cs.AI
摘要
現代語言模型日益採用結合全注意力與高效注意力模組(例如滑動窗口注意力(SWA)與遞迴序列混合器)的混合架構。然而,人們對這些高效模組如何塑造模型能力仍知之甚少。為填補此缺口,我們從三個面向——縮放行為、機制分析與架構設計——對混合架構進行系統性分析。首先,從縮放角度來看,我們發現高效注意力設計主要影響長上下文能力浮現的速度,而不同的混合架構在充分訓練後,最終會收斂至可比的長上下文表現。其次,從機制上,我們證明長距離檢索主要由全注意力承擔,而高效注意力則塑造其最佳化軌跡。這解釋了一個我們稱之為「大窗口惰性」(Large-Window Laziness)的反直覺現象:較大的SWA窗口可能延遲全注意力層中檢索頭的形成。第三,在此機制引導下,我們證明僅對小窗口SWA混合架構的全注意力層應用無位置編碼(NoPE),能在幾乎不影響短上下文表現的情況下,顯著改善長上下文表現。
English
Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.