線形トランスフォーマーは汎用的なインコンテキスト学習器である

要旨

近年の研究により、特に線形アテンションモデルを含むトランスフォーマーが、フォワード推論ステップにおいて、コンテキスト内で提供されたデータに対して勾配降下法に似たアルゴリズムを暗黙的に実行することが実証されています。しかし、より複雑な問題を処理する能力については未解明のままです。本論文では、任意の線形トランスフォーマーが暗黙的な線形モデルを維持し、前処理付き勾配降下法の一種を実行していると解釈できることを証明します。また、学習データが異なるレベルのノイズで汚染されているという困難なシナリオにおける線形トランスフォーマーの使用についても調査します。驚くべきことに、この問題に対して線形トランスフォーマーが複雑で非常に効果的な最適化アルゴリズムを発見し、多くの合理的なベースラインを上回るか同等の性能を達成することを実証します。このアルゴリズムを逆解析し、ノイズレベルに基づくモーメンタムと適応的リスケーリングを組み込んだ新規のアプローチであることを示します。我々の発見は、線形トランスフォーマーでさえも、洗練された最適化戦略を発見する驚くべき能力を有していることを示しています。

English

Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in handling more complex problems remains unexplored. In this paper, we prove that any linear transformer maintains an implicit linear model and can be interpreted as performing a variant of preconditioned gradient descent. We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise. Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. We reverse-engineer this algorithm and show that it is a novel approach incorporating momentum and adaptive rescaling based on noise levels. Our findings show that even linear transformers possess the surprising ability to discover sophisticated optimization strategies.

線形トランスフォーマーは汎用的なインコンテキスト学習器である

Linear Transformers are Versatile In-Context Learners

要旨

Support