Sessa: Selectieve State Space Aandacht

Samenvatting

Het moderne sequentiemodelleren wordt gedomineerd door twee families: Transformers, waarvan de zelf-attentie toegang heeft tot willekeurige elementen van de zichtbare sequentie, en gestructureerde toestandsruimtemodellen, die informatie propageren via een expliciete recurrente toestand. Deze mechanismen kennen verschillende beperkingen in lange contexten: wanneer de aandacht diffuus is, wordt de invloed van individuele tokens verdund over het effectieve draagvlak, terwijl recurrente toestandspropagatie gevoeligheid voor lange afstand kan verliezen tenzij informatie actief wordt bewaard. Hierdoor hebben beide mechanismen uitdagingen bij het behouden en selectief ophalen van informatie over lange contexten. Wij stellen Sessa voor, een decoder die aandacht plaatst binnen een recurrente terugkoppelingslus. Dit creëert vele op aandacht gebaseerde paden waarlangs eerdere tokens toekomstige toestanden kunnen beïnvloeden, in plaats van te vertrouwen op een enkele aandacht-leesoperatie of een enkele recurrente keten. Wij bewijzen dat, onder expliciete aannames en in overeenkomstige regimes, Sessa geheugenstaarten met een machtswet O(ℓ^{-β}) toelaat voor 0 < β < 1, met een langzamer verval dan in de corresponderende Transformer- en Mamba-stijl baseline-modellen. Wij geven verder een expliciete constructie die deze machtswetsnelheid bereikt. Onder dezelfde aannames is Sessa de enige modelklasse van de beschouwde klassen die flexibele selectieve retrieval realiseert, inclusief profielen waarvan de invloed niet afneemt met de afstand. In overeenstemming met dit theoretische voordeel, behaalt Sessa in overeenkomstige experimenten de sterkste prestaties op lange-context benchmarks, terwijl het competitief blijft met Transformer- en Mamba-stijl basismodellen bij taalmodellering met korte context.

English

Modern sequence modeling is dominated by two families: Transformers, whose self-attention can access arbitrary elements of the visible sequence, and structured state-space models, which propagate information through an explicit recurrent state. These mechanisms face different limitations on long contexts: when attention is diffuse, the influence of individual tokens is diluted across the effective support, while recurrent state propagation can lose long-range sensitivity unless information is actively preserved. As a result, both mechanisms face challenges in preserving and selectively retrieving information over long contexts. We propose Sessa, a decoder that places attention inside a recurrent feedback path. This creates many attention-based paths through which past tokens can influence future states, rather than relying on a single attention read or a single recurrent chain. We prove that, under explicit assumptions and matched regimes, Sessa admits power-law memory tails O(ell^{-β}) for 0 < β< 1, with slower decay than in the corresponding Transformer and Mamba-style baselines. We further give an explicit construction that achieves this power-law rate. Under the same assumptions, Sessa is the only model class among those considered that realizes flexible selective retrieval, including profiles whose influence does not decay with distance. Consistent with this theoretical advantage, across matched experiments, Sessa achieves the strongest performance on long-context benchmarks while remaining competitive with Transformer and Mamba-style baselines on short-context language modeling.

Sessa: Selectieve State Space Aandacht

Sessa: Selective State Space Attention

Samenvatting

Support