ATLAS: テスト時にコンテキストを最適に記憶する学習

要旨

Transformerは、主に文脈内検索タスクにおける有効性と大規模学習の能力から、シーケンスモデリングにおいて最も人気のある基盤モデルとして確立されています。しかし、その二次的なメモリと時間計算量は、より長いシーケンスへの適用性を制限し、現代的なリカレントニューラルネットワーク（別名、長期リカレントメモリモジュール）などの効果的な代替アーキテクチャの探求を研究者に促してきました。これらのモデルは多様な下流タスクで最近成功を収めていますが、長い文脈理解やより長いシーケンスへの外挿を必要とするタスクでは苦戦しています。私たちは、これらの欠点が設計上の3つの分離した側面に起因していることを観察しました：（1）メモリのアーキテクチャと入力の特徴マッピングによって制限される限られたメモリ容量、（2）最後の入力に対してのみメモリを最適化するオンライン更新の性質、（3）固定サイズのメモリの表現力の低い管理。これら3つの側面を強化するために、私たちはATLASを提案します。ATLASは、現在および過去のトークンに基づいてメモリを最適化することで文脈を記憶する高容量の長期メモリモジュールであり、長期メモリモデルのオンライン性質を克服します。この洞察に基づいて、私たちはDeepTransformersと呼ばれる新しいTransformer風アーキテクチャのファミリーを提示します。これらは元のTransformerアーキテクチャの厳密な一般化です。言語モデリング、常識推論、リコール集約型、および長文脈理解タスクにおける実験結果は、ATLASがTransformerや最近の線形リカレントモデルの性能を凌駕することを示しています。ATLASはさらに、Titansの長文脈性能を向上させ、BABILongベンチマークの10M文脈長で+80%の精度を達成しました。

English

Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present ATLAS, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that ATLAS surpasses the performance of Transformers and recent linear recurrent models. ATLAS further improves the long context performance of Titans, achieving +80\% accuracy in 10M context length of BABILong benchmark.

ATLAS: テスト時にコンテキストを最適に記憶する学習

ATLAS: Learning to Optimally Memorize the Context at Test Time

要旨

Support