线性变换器是多功能的上下文学习器。

摘要

最近的研究表明，transformers，尤其是线性注意力模型，在前向推断步骤中隐式执行类似梯度下降的算法，对提供的上下文数据进行处理。然而，它们在处理更复杂问题方面的能力尚未被探索。在本文中，我们证明任何线性transformer都保持隐式线性模型，并可被解释为执行一种变种的预条件梯度下降。我们还研究了在线性transformer在训练数据受到不同级别噪声干扰的挑战性场景中的应用。值得注意的是，我们证明对于这个问题，线性transformer发现了一种复杂且高效的优化算法，超越或与许多合理基线的性能相匹敌。我们对这一算法进行了逆向工程，并展示它是一种基于动量和根据噪声水平自适应调整的新颖方法。我们的发现表明，即使是线性transformers也具有发现复杂优化策略的惊人能力。

English

Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in handling more complex problems remains unexplored. In this paper, we prove that any linear transformer maintains an implicit linear model and can be interpreted as performing a variant of preconditioned gradient descent. We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise. Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. We reverse-engineer this algorithm and show that it is a novel approach incorporating momentum and adaptive rescaling based on noise levels. Our findings show that even linear transformers possess the surprising ability to discover sophisticated optimization strategies.

线性变换器是多功能的上下文学习器。

Linear Transformers are Versatile In-Context Learners

摘要

Support