记忆缓存：具有增长记忆的循环神经网络

摘要

Transformer凭借其随上下文长度增长的内存容量，已成为序列建模领域绝大多数最新进展的事实基础架构。虽然这种特性在检索任务中具有合理性，但它导致了二次方复杂度，从而促使近期研究探索可行的次二次循环替代方案。尽管这些循环架构在多个领域展现出初步潜力，但在召回密集型任务中表现不及Transformer，这通常归因于其固定大小的记忆体。本文提出记忆缓存（MC）技术，通过缓存记忆状态（即隐藏状态）的检查点来增强循环模型。记忆缓存使RNN的有效记忆容量能够随序列长度增长，提供了在RNN的固定记忆复杂度（O(L)）与Transformer的增长记忆复杂度（O(L²)）之间灵活插值的权衡方案。我们提出MC的四种变体，包括门控聚合和稀疏选择机制，并探讨其对线性和深度记忆模块的影响。在语言建模和长上下文理解任务上的实验表明，MC能有效提升循环模型性能。上下文召回任务的结果显示，虽然Transformer达到最佳准确率，但我们的MC变体展现出具有竞争力的性能，显著缩小了与Transformer的差距，且优于当前最先进的循环模型。

English

Transformers have been established as the de-facto backbones for most recent advances in sequence modeling, mainly due to their growing memory capacity that scales with the context length. While plausible for retrieval tasks, it causes quadratic complexity and so has motivated recent studies to explore viable subquadratic recurrent alternatives. Despite showing promising preliminary results in diverse domains, such recurrent architectures underperform Transformers in recall-intensive tasks, often attributed to their fixed-size memory. In this paper, we introduce Memory Caching (MC), a simple yet effective technique that enhances recurrent models by caching checkpoints of their memory states (a.k.a. hidden states). Memory Caching allows the effective memory capacity of RNNs to grow with sequence length, offering a flexible trade-off that interpolates between the fixed memory (i.e., O(L) complexity) of RNNs and the growing memory (i.e., O(L^2) complexity) of Transformers. We propose four variants of MC, including gated aggregation and sparse selective mechanisms, and discuss their implications on both linear and deep memory modules. Our experimental results on language modeling, and long-context understanding tasks show that MC enhances the performance of recurrent models, supporting its effectiveness. The results of in-context recall tasks indicate that while Transformers achieve the best accuracy, our MC variants show competitive performance, close the gap with Transformers, and performs better than state-of-the-art recurrent models.

记忆缓存：具有增长记忆的循环神经网络

Memory Caching: RNNs with Growing Memory

摘要

Support