線性轉換器是多功能的上下文學習器。

摘要

最近的研究已經證明，特別是線性注意力模型的變壓器，在前向推理步驟中對提供的上下文數據隱式執行類似梯度下降的算法。然而，它們在處理更複雜問題方面的能力尚未被探索。在本文中，我們證明任何線性變壓器都保持隱式線性模型，並可被解釋為執行一種變形的預條件梯度下降。我們還研究了線性變壓器在一個具有挑戰性情境中的應用，其中訓練數據受到不同程度噪音干擾。顯著的是，我們證明對於這個問題，線性變壓器發現了一種複雜且高效的優化算法，超越或與許多合理基準相匹敵。我們逆向工程這個算法，並展示它是一種新穎方法，結合了基於噪音水平的動量和自適應重縮放。我們的研究結果表明，即使是線性變壓器也具有發現複雜優化策略的驚人能力。

English

Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in handling more complex problems remains unexplored. In this paper, we prove that any linear transformer maintains an implicit linear model and can be interpreted as performing a variant of preconditioned gradient descent. We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise. Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. We reverse-engineer this algorithm and show that it is a novel approach incorporating momentum and adaptive rescaling based on noise levels. Our findings show that even linear transformers possess the surprising ability to discover sophisticated optimization strategies.

線性轉換器是多功能的上下文學習器。

Linear Transformers are Versatile In-Context Learners

摘要

Support