ATLAS：学习在测试时最优记忆上下文

摘要

Transformer已被确立为序列建模中最受欢迎的骨干网络，主要得益于其在上下文检索任务中的高效表现以及大规模学习的能力。然而，其二次方的内存和时间复杂度限制了其在长序列中的应用，这促使研究人员探索有效的替代架构，如现代循环神经网络（又称长期循环记忆模块）。尽管这些网络在多种下游任务中取得了成功，但在需要长上下文理解及向更长序列外推的任务中仍显不足。我们观察到，这些不足源于其设计中的三个独立方面：(1) 受限于内存架构和输入特征映射的有限内存容量；(2) 更新的在线性质，即仅基于最新输入优化内存；以及 (3) 对固定大小内存的表达管理不足。为了增强这三个方面，我们提出了ATLAS，一个高容量的长期记忆模块，它通过基于当前及过去令牌优化内存来学习记忆上下文，从而克服了长期记忆模型的在线性。基于这一洞见，我们提出了一类新的类Transformer架构，称为DeepTransformers，它们是对原始Transformer架构的严格泛化。我们在语言建模、常识推理、召回密集型及长上下文理解任务上的实验结果表明，ATLAS超越了Transformer及近期线性循环模型的性能。ATLAS进一步提升了Titans在长上下文中的表现，在BABILong基准测试的1000万上下文长度上实现了+80%的准确率提升。

English

Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present ATLAS, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that ATLAS surpasses the performance of Transformers and recent linear recurrent models. ATLAS further improves the long context performance of Titans, achieving +80\% accuracy in 10M context length of BABILong benchmark.

ATLAS：学习在测试时最优记忆上下文

ATLAS: Learning to Optimally Memorize the Context at Test Time

摘要

Support