Parallax: Geparameteriseerde Lokale Lineaire Aandacht voor Taalmodellering

Samenvatting

Grote Taalmodellen (Large Language Models, LLM's) zijn het centrale paradigma in kunstmatige intelligentie geworden, maar de kernberekeningsprimitieve van aandacht is structureel onveranderd gebleven. Lokaal Lineaire Aandacht (Local Linear Attention, LLA) is een aandachtsmechanisme dat is afgeleid van niet-parametrische statistiek in het regressieraamwerk tijdens testtijd. In tegenstelling tot eerder onderzoek naar efficiënte aandachtsvarianten verhoogt LLA de lokale constante schatting in softmax-aandacht naar een lokale lineaire schatting, wat een aantoonbaar superieure bias-variantie-afweging voor associatief geheugen oplevert. LLA is echter niet opgeschaald in LLM-voortraining vanwege computationele en numerieke stabiliteitszorgen. We introduceren Parallax, een geparametriseerde Lokaal Lineaire Aandacht die schaalbaar is voor LLM's. Parallax elimineert de numerieke oplosser in LLA en leert een extra query-achtige projector die de KV-covariantie onderzoekt. We plaatsen Parallax binnen een familie van aandachtsmechanismen die verbonden zijn door de bandbreedte, de constructie van de sonde en de affiene structuur. We stellen een hardwarebewust algoritme voor dat de rekenintensiteit verhoogt ten opzichte van FlashAttention, waardoor aandacht verschuift naar een meer rekeningebonden regime. Onze prototype-decodeerkernel evenaart of overtreft FlashAttention 2/3 over uiteenlopende batchgroottes en contextlengtes. We trainen Parallax voor op schalen van 0,6B en 1,7B en vinden consistente perplexiteitsverbeteringen gedurende de voortraining, met winsten die overdragen naar stroomafwaartse benchmarks. Het voordeel blijft bestaan onder zowel parameter-gematchte als rekengematchte controles, wat een Pareto-verbetering aantoont. We voeren zorgvuldige voortrainingsablatiestudies uit en identificeren een nieuw fenomeen waarbij Muon de capaciteit van Parallax vrijmaakt. Naar ons weten is dit de eerste empirische demonstratie van sterk architectuur-optimizer co-ontwerp voor aandachtsmechanismen in de architectuuronderzoeksliteratuur.

English

Large Language Models (LLMs) have become the central paradigm in artificial intelligence, yet the core computational primitive of attention has remained structurally unchanged. Local Linear Attention (LLA) is an attention mechanism derived from nonparametric statistics in the test-time regression framework. In contrast to prior research on efficient attention variants, LLA upgrades the local constant estimate in softmax attention to a local linear estimate, yielding provably superior bias-variance tradeoffs for associative memory. However, LLA has not been scaled in LLM pretraining due to computational and numerical stability concerns. We introduce Parallax, a parameterized Local Linear Attention that is scalable for LLMs. Parallax eliminates the numerical solver in LLA and learns an extra query-like projector that probes the KV covariance. We place Parallax within a family of attention mechanisms connected by the bandwidth, the probe construction and the affine structure. We propose a hardware-aware algorithm that increases the arithmetic intensity over FlashAttention, shifting attention into a more compute bound regime. Our prototype decode kernel matches or outperforms FlashAttention 2/3 across diverse batch sizes and context lengths. We pretrain Parallax at 0.6B and 1.7B scales and find consistent perplexity improvements throughout pretraining with gains that transfer to downstream benchmarks. The advantage persists under both parameter-matched and compute-matched controls, demonstrating a Pareto improvement. We perform careful pretraining ablations and identify a novel phenomenon whereby Muon unlocks the capacity of Parallax. To our knowledge, this is the first empirical demonstration of strong architecture-optimizer codesign for attention mechanisms in the architecture research literature.