Heroverweging van de rol van efficiënte aandacht in hybride architecturen

Samenvatting

Moderne taalmodellen nemen steeds vaker hybride architecturen aan die volledige aandacht combineren met efficiënte aandachtsmodules, zoals schuifvensteraandacht (sliding-window attention, SWA) en recurrente sequentiemixers. Het is echter nog niet goed begrepen hoe deze efficiënte modules de mogelijkheden van modellen beïnvloeden. Om deze leemte aan te vullen, voeren we een systematische analyse uit van hybride architecturen vanuit drie perspectieven: schalingsgedrag, mechanismeanalyse en architectuurontwerp. Ten eerste ontdekken we vanuit een schalingsperspectief dat het ontwerp van efficiënte aandacht voornamelijk beïnvloedt hoe snel het vermogen voor lange contexten ontstaat, terwijl verschillende hybriden uiteindelijk convergeren naar vergelijkbare prestaties voor lange contexten bij voldoende training. Ten tweede tonen we mechanistisch aan dat langeafstandsophaling voornamelijk wordt gedragen door volledige aandacht, terwijl efficiënte aandacht het optimalisatietraject ervan vormgeeft. Dit verklaart een contra-intuïtief fenomeen dat we 'Large-Window Laziness' noemen: grotere SWA-vensters kunnen de vorming van ophaalkoppen in volledige-aandachtslagen vertragen. Ten derde laten we, geleid door dit mechanisme, zien dat het toepassen van NoPE (geen positiecodering) alleen op de volledige-aandachtslagen van een hybride met kleine SWA-vensters de prestaties voor lange contexten aanzienlijk verbetert, met een verwaarloosbare impact op de prestaties voor korte contexten.

English

Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.