重新思考高效注意力在混合架構中的角色

摘要

現代語言模型日益採用結合全注意力與高效注意力模組（例如滑動窗口注意力（SWA）與遞迴序列混合器）的混合架構。然而，人們對這些高效模組如何塑造模型能力仍知之甚少。為填補此缺口，我們從三個面向——縮放行為、機制分析與架構設計——對混合架構進行系統性分析。首先，從縮放角度來看，我們發現高效注意力設計主要影響長上下文能力浮現的速度，而不同的混合架構在充分訓練後，最終會收斂至可比的長上下文表現。其次，從機制上，我們證明長距離檢索主要由全注意力承擔，而高效注意力則塑造其最佳化軌跡。這解釋了一個我們稱之為「大窗口惰性」（Large-Window Laziness）的反直覺現象：較大的SWA窗口可能延遲全注意力層中檢索頭的形成。第三，在此機制引導下，我們證明僅對小窗口SWA混合架構的全注意力層應用無位置編碼（NoPE），能在幾乎不影響短上下文表現的情況下，顯著改善長上下文表現。

English

Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.