ATLAS：學習在測試時最優化記憶上下文

摘要

Transformer已確立為序列建模中最受歡迎的骨幹架構，這主要歸功於其在上下文檢索任務中的高效能及大規模學習的能力。然而，其二次方的記憶體與時間複雜度限制了其在較長序列中的應用，這促使研究人員探索有效的替代架構，如現代循環神經網絡（又稱長期循環記憶模組）。儘管這些架構在各種下游任務中取得了近期的成功，但在需要長上下文理解及對更長序列進行外推的任務中仍顯吃力。我們觀察到，這些不足源自其設計中的三個相互獨立的方面：(1) 受限的記憶容量，這受到記憶架構及輸入特徵映射的限制；(2) 更新的在線性質，即僅針對最後的輸入來優化記憶；以及(3) 對其固定大小記憶的較低表達性管理。為增強這三個方面，我們提出了ATLAS，這是一個高容量的長期記憶模組，它通過基於當前及過去的詞元來優化記憶，從而學會記住上下文，克服了長期記憶模型的在線性質。基於這一洞察，我們提出了一種新的類Transformer架構家族，稱為DeepTransformers，它們是原始Transformer架構的嚴格泛化。我們在語言建模、常識推理、召回密集型及長上下文理解任務上的實驗結果顯示，ATLAS超越了Transformer及近期線性循環模型的表現。ATLAS進一步提升了Titans的長上下文性能，在BABILong基準測試的10M上下文長度中實現了+80%的準確率。

English

Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present ATLAS, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that ATLAS surpasses the performance of Transformers and recent linear recurrent models. ATLAS further improves the long context performance of Titans, achieving +80\% accuracy in 10M context length of BABILong benchmark.

ATLAS：學習在測試時最優化記憶上下文

ATLAS: Learning to Optimally Memorize the Context at Test Time

摘要

Support