门控关联记忆：一种用于高效序列建模的并行O(N)架构

摘要

基于自注意力机制的Transformer架构已成为序列建模任务的事实标准。然而，其核心计算原语随序列长度呈二次方增长（O(N^2)），在处理长上下文时形成了显著的瓶颈。本文提出了一种全新的、完全并行的序列建模架构——门控关联记忆（GAM）网络，该架构在序列长度上展现出线性复杂度（O(N)）。GAM模块以两条并行路径取代了自注意力层：一条因果卷积路径，用于高效捕捉局部、位置依赖的上下文；另一条并行关联记忆检索机制，用于建模全局、基于内容的模式。这两条路径通过门控机制动态融合，使模型能够灵活地为每个令牌结合局部与全局信息。我们从零开始实现了GAM，并在WikiText-2基准测试中与标准Transformer模型及现代线性时间基线（Mamba）进行了严格的对比分析，同时在TinyStories数据集上与Transformer进行了比较。实验结果表明，GAM在训练速度上始终更快，超越了两个基线，并在所有数据集上实现了更优或具有竞争力的最终验证困惑度，确立了其作为序列建模高效替代方案的潜力。

English

The Transformer architecture, underpinned by the self-attention mechanism, has become the de facto standard for sequence modeling tasks. However, its core computational primitive scales quadratically with sequence length (O(N^2)), creating a significant bottleneck for processing long contexts. In this paper, we propose the Gated Associative Memory (GAM) network, a novel, fully parallel architecture for sequence modeling that exhibits linear complexity (O(N)) with respect to sequence length. The GAM block replaces the self-attention layer with two parallel pathways: a causal convolution to efficiently capture local, position-dependent context, and a parallel associative memory retrieval mechanism to model global, content-based patterns. These pathways are dynamically fused using a gating mechanism, allowing the model to flexibly combine local and global information for each token. We implement GAM from scratch and conduct a rigorous comparative analysis against a standard Transformer model and a modern linear-time baseline (Mamba) on the WikiText-2 benchmark, as well as against the Transformer on the TinyStories dataset. Our experiments demonstrate that GAM is consistently faster, outperforming both baselines on training speed, and achieves a superior or competitive final validation perplexity across all datasets, establishing it as a promising and efficient alternative for sequence modeling.

门控关联记忆：一种用于高效序列建模的并行O(N)架构

Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling

摘要

Support