線性轉換器是多功能的上下文學習器。
Linear Transformers are Versatile In-Context Learners
February 21, 2024
作者: Max Vladymyrov, Johannes von Oswald, Mark Sandler, Rong Ge
cs.AI
摘要
最近的研究已經證明,特別是線性注意力模型的變壓器,在前向推理步驟中對提供的上下文數據隱式執行類似梯度下降的算法。然而,它們在處理更複雜問題方面的能力尚未被探索。在本文中,我們證明任何線性變壓器都保持隱式線性模型,並可被解釋為執行一種變形的預條件梯度下降。我們還研究了線性變壓器在一個具有挑戰性情境中的應用,其中訓練數據受到不同程度噪音干擾。顯著的是,我們證明對於這個問題,線性變壓器發現了一種複雜且高效的優化算法,超越或與許多合理基準相匹敵。我們逆向工程這個算法,並展示它是一種新穎方法,結合了基於噪音水平的動量和自適應重縮放。我們的研究結果表明,即使是線性變壓器也具有發現複雜優化策略的驚人能力。
English
Recent research has demonstrated that transformers, particularly linear
attention models, implicitly execute gradient-descent-like algorithms on data
provided in-context during their forward inference step. However, their
capability in handling more complex problems remains unexplored. In this paper,
we prove that any linear transformer maintains an implicit linear model and can
be interpreted as performing a variant of preconditioned gradient descent. We
also investigate the use of linear transformers in a challenging scenario where
the training data is corrupted with different levels of noise. Remarkably, we
demonstrate that for this problem linear transformers discover an intricate and
highly effective optimization algorithm, surpassing or matching in performance
many reasonable baselines. We reverse-engineer this algorithm and show that it
is a novel approach incorporating momentum and adaptive rescaling based on
noise levels. Our findings show that even linear transformers possess the
surprising ability to discover sophisticated optimization strategies.