Gated Associative Memory: Een Parallel O(N)-Architectuur voor Efficiënte Sequentiële Modellering

Samenvatting

De Transformer-architectuur, ondersteund door het self-attention-mechanisme, is de facto standaard geworden voor taken op het gebied van sequentiemodellering. Het kerncomputatieprimitief schaalt echter kwadratisch met de sequentielengte (O(N^2)), wat een aanzienlijk knelpunt vormt bij het verwerken van lange contexten. In dit artikel stellen we het Gated Associative Memory (GAM)-netwerk voor, een nieuwe, volledig parallelle architectuur voor sequentiemodellering die lineaire complexiteit (O(N)) vertoont ten opzichte van de sequentielengte. Het GAM-blok vervangt de self-attention-laag door twee parallelle paden: een causale convolutie om lokaal, positie-afhankelijk context efficiënt vast te leggen, en een parallel mechanisme voor associatief geheugen om globale, inhoudsgebaseerde patronen te modelleren. Deze paden worden dynamisch samengevoegd met behulp van een gating-mechanisme, waardoor het model lokaal en globaal informatie voor elk token flexibel kan combineren. We implementeren GAM vanaf de grond en voeren een grondige vergelijkende analyse uit tegen een standaard Transformer-model en een moderne lineaire baseline (Mamba) op de WikiText-2-benchmark, evenals tegen de Transformer op de TinyStories-dataset. Onze experimenten tonen aan dat GAM consistent sneller is, beide baselines overtreft qua trainingssnelheid, en een superieure of competitieve uiteindelijke validatieperplexiteit behaalt op alle datasets, wat het een veelbelovend en efficiënt alternatief voor sequentiemodellering maakt.

English

The Transformer architecture, underpinned by the self-attention mechanism, has become the de facto standard for sequence modeling tasks. However, its core computational primitive scales quadratically with sequence length (O(N^2)), creating a significant bottleneck for processing long contexts. In this paper, we propose the Gated Associative Memory (GAM) network, a novel, fully parallel architecture for sequence modeling that exhibits linear complexity (O(N)) with respect to sequence length. The GAM block replaces the self-attention layer with two parallel pathways: a causal convolution to efficiently capture local, position-dependent context, and a parallel associative memory retrieval mechanism to model global, content-based patterns. These pathways are dynamically fused using a gating mechanism, allowing the model to flexibly combine local and global information for each token. We implement GAM from scratch and conduct a rigorous comparative analysis against a standard Transformer model and a modern linear-time baseline (Mamba) on the WikiText-2 benchmark, as well as against the Transformer on the TinyStories dataset. Our experiments demonstrate that GAM is consistently faster, outperforming both baselines on training speed, and achieves a superior or competitive final validation perplexity across all datasets, establishing it as a promising and efficient alternative for sequence modeling.

Gated Associative Memory: Een Parallel O(N)-Architectuur voor Efficiënte Sequentiële Modellering

Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling

Samenvatting

Support