門控關聯記憶：一種並行O(N)架構，用於高效序列建模

摘要

基於自注意力機制的Transformer架構已成為序列建模任務的實際標準。然而，其核心計算原語的複雜度隨序列長度呈二次方增長（O(N^2)），這在處理長上下文時形成了顯著的瓶頸。本文提出了一種新穎的、完全並行的序列建模架構——門控關聯記憶（GAM）網絡，該架構在序列長度方面展現出線性複雜度（O(N)）。GAM模塊以兩個並行路徑取代了自注意力層：一個是因果卷積，用於高效捕捉局部、位置依賴的上下文；另一個是並行的關聯記憶檢索機制，用於建模基於內容的全局模式。這些路徑通過門控機制動態融合，使模型能夠靈活地為每個令牌結合局部和全局信息。我們從頭實現了GAM，並在WikiText-2基準上與標準Transformer模型及現代線性時間基線（Mamba）進行了嚴格的對比分析，同時在TinyStories數據集上與Transformer進行了比較。我們的實驗表明，GAM在訓練速度上始終更快，超越了兩個基線，並且在所有數據集上均達到了優越或具有競爭力的最終驗證困惑度，從而確立了其作為序列建模的一種有前景且高效的替代方案。

English

The Transformer architecture, underpinned by the self-attention mechanism, has become the de facto standard for sequence modeling tasks. However, its core computational primitive scales quadratically with sequence length (O(N^2)), creating a significant bottleneck for processing long contexts. In this paper, we propose the Gated Associative Memory (GAM) network, a novel, fully parallel architecture for sequence modeling that exhibits linear complexity (O(N)) with respect to sequence length. The GAM block replaces the self-attention layer with two parallel pathways: a causal convolution to efficiently capture local, position-dependent context, and a parallel associative memory retrieval mechanism to model global, content-based patterns. These pathways are dynamically fused using a gating mechanism, allowing the model to flexibly combine local and global information for each token. We implement GAM from scratch and conduct a rigorous comparative analysis against a standard Transformer model and a modern linear-time baseline (Mamba) on the WikiText-2 benchmark, as well as against the Transformer on the TinyStories dataset. Our experiments demonstrate that GAM is consistently faster, outperforming both baselines on training speed, and achieves a superior or competitive final validation perplexity across all datasets, establishing it as a promising and efficient alternative for sequence modeling.

門控關聯記憶：一種並行O(N)架構，用於高效序列建模

Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling

摘要

Support