梯度下降的一个步骤在具有一层线性自注意力的上下文中被证明是最优的学习器。
One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention
July 7, 2023
作者: Arvind Mahankali, Tatsunori B. Hashimoto, Tengyu Ma
cs.AI
摘要
最近的研究已经从实证角度分析了上下文学习,并表明在合成线性回归任务上训练的Transformer可以学会实现岭回归,即在容量足够的情况下是贝叶斯最优预测器[Aky\"urek等,2023],而具有线性自注意力且没有MLP层的单层Transformer将学会在最小二乘线性回归目标上执行一步梯度下降(GD)[von Oswald等,2022]。然而,这些观察背后的理论仍然知之甚少。我们从理论上研究了单层线性自注意力Transformer,它们在合成带有噪声的线性回归数据上进行训练。首先,我们数学上证明,当协变量来自标准高斯分布时,最小化预训练损失的单层Transformer将实现最小二乘线性回归目标上的一步GD。然后,我们发现改变协变量和权重向量的分布为非各向同性高斯分布对学到的算法有很大影响:预训练损失的全局最小化者现在实现了一步经过预条件处理的GD。然而,如果仅改变响应的分布,则对学到的算法影响不大:即使响应来自更一般的非线性函数族,预训练损失的全局最小化者仍然在最小二乘线性回归目标上实现一步GD。
English
Recent works have empirically analyzed in-context learning and shown that
transformers trained on synthetic linear regression tasks can learn to
implement ridge regression, which is the Bayes-optimal predictor, given
sufficient capacity [Aky\"urek et al., 2023], while one-layer transformers with
linear self-attention and no MLP layer will learn to implement one step of
gradient descent (GD) on a least-squares linear regression objective [von
Oswald et al., 2022]. However, the theory behind these observations remains
poorly understood. We theoretically study transformers with a single layer of
linear self-attention, trained on synthetic noisy linear regression data.
First, we mathematically show that when the covariates are drawn from a
standard Gaussian distribution, the one-layer transformer which minimizes the
pre-training loss will implement a single step of GD on the least-squares
linear regression objective. Then, we find that changing the distribution of
the covariates and weight vector to a non-isotropic Gaussian distribution has a
strong impact on the learned algorithm: the global minimizer of the
pre-training loss now implements a single step of pre-conditioned
GD. However, if only the distribution of the responses is changed, then this
does not have a large effect on the learned algorithm: even when the response
comes from a more general family of nonlinear functions, the global
minimizer of the pre-training loss still implements a single step of GD on a
least-squares linear regression objective.