門控關聯記憶:一種並行O(N)架構,用於高效序列建模
Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling
August 30, 2025
作者: Rishiraj Acharya
cs.AI
摘要
基於自注意力機制的Transformer架構已成為序列建模任務的實際標準。然而,其核心計算原語的複雜度隨序列長度呈二次方增長(O(N^2)),這在處理長上下文時形成了顯著的瓶頸。本文提出了一種新穎的、完全並行的序列建模架構——門控關聯記憶(GAM)網絡,該架構在序列長度方面展現出線性複雜度(O(N))。GAM模塊以兩個並行路徑取代了自注意力層:一個是因果卷積,用於高效捕捉局部、位置依賴的上下文;另一個是並行的關聯記憶檢索機制,用於建模基於內容的全局模式。這些路徑通過門控機制動態融合,使模型能夠靈活地為每個令牌結合局部和全局信息。我們從頭實現了GAM,並在WikiText-2基準上與標準Transformer模型及現代線性時間基線(Mamba)進行了嚴格的對比分析,同時在TinyStories數據集上與Transformer進行了比較。我們的實驗表明,GAM在訓練速度上始終更快,超越了兩個基線,並且在所有數據集上均達到了優越或具有競爭力的最終驗證困惑度,從而確立了其作為序列建模的一種有前景且高效的替代方案。
English
The Transformer architecture, underpinned by the self-attention mechanism,
has become the de facto standard for sequence modeling tasks. However, its core
computational primitive scales quadratically with sequence length (O(N^2)),
creating a significant bottleneck for processing long contexts. In this paper,
we propose the Gated Associative Memory (GAM) network, a novel, fully parallel
architecture for sequence modeling that exhibits linear complexity (O(N)) with
respect to sequence length. The GAM block replaces the self-attention layer
with two parallel pathways: a causal convolution to efficiently capture local,
position-dependent context, and a parallel associative memory retrieval
mechanism to model global, content-based patterns. These pathways are
dynamically fused using a gating mechanism, allowing the model to flexibly
combine local and global information for each token. We implement GAM from
scratch and conduct a rigorous comparative analysis against a standard
Transformer model and a modern linear-time baseline (Mamba) on the WikiText-2
benchmark, as well as against the Transformer on the TinyStories dataset. Our
experiments demonstrate that GAM is consistently faster, outperforming both
baselines on training speed, and achieves a superior or competitive final
validation perplexity across all datasets, establishing it as a promising and
efficient alternative for sequence modeling.