Hogere-orde Lineaire Aandacht

Samenvatting

De kwadratische kosten van 'scaled dot-product attention' vormen een centrale belemmering bij het schalen van autoregressieve taalmodellen naar lange contexten. Lineaire-tijd aandacht en State Space Models (SSMs) bieden schaalbare alternatieven, maar zijn doorgaans beperkt tot eerste-orde- of kernelgebaseerde benaderingen, wat de expressiviteit kan beperken. Wij introduceren Higher-order Linear Attention (HLA), een causaal, streaming-mechanisme dat hogere interacties realiseert via compacte prefix-voldoende-statistieken. In het tweede-ordegeval handhaaft HLA een constante toestandsgrootte en berekent per-token-uitvoer in lineaire tijd zonder enige n-bij-n-matrices te materialiseren. Wij geven gesloten streaming-identiteiten, een strikt causale gemaskeerde variant met twee extra samenvattingen, en een chunk-parallelle trainingsschema gebaseerd op associatieve scans die de activaties van een seriële recurrentie exact reproduceert. Wij schetsen verder uitbreidingen naar de derde en hogere ordes. Collectief positioneren deze resultaten HLA als een principieel, schaalbaar bouwsteen dat aandacht-achtige, data-afhankelijke mixing combineert met de efficiëntie van moderne recurrent architecturen. Projectpagina: https://github.com/yifanzhang-pro/HLA.

English

The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any n times n matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.