ChatPaper.aiChatPaper

梯度下降的一步在具有一層線性自注意力的最佳內文學習者中可以被證明是最優的。

One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention

July 7, 2023
作者: Arvind Mahankali, Tatsunori B. Hashimoto, Tengyu Ma
cs.AI

摘要

最近的研究已經從實證的角度分析了上下文學習,並顯示在合成線性回歸任務上訓練的Transformer可以學會實現脊回歸(ridge regression),這是在具備足夠容量的情況下的貝葉斯最優預測器[Aky\"urek等人,2023],而具有線性自注意力(linear self-attention)且沒有MLP層的單層Transformer將學習實現最小二乘線性回歸目標的一步梯度下降(GD)[von Oswald等人,2022]。然而,這些觀察背後的理論仍然理解不足。我們在理論上研究了具有單層線性自注意力的Transformer,在合成有噪音的線性回歸數據上進行訓練。首先,我們在數學上證明,當协变量來自標準高斯分佈時,最小化預訓練損失的單層Transformer將實現最小二乘線性回歸目標的一步GD。然後,我們發現改變协变量和權重向量的分佈為非各向同性高斯分佈對學到的算法有很大影響:預訓練損失的全局最小化者現在實現了預處理GD的一步。然而,如果僅改變響應的分佈,則對學到的算法影響不大:即使響應來自更一般的非線性函數族,預訓練損失的全局最小化者仍然實現了最小二乘線性回歸目標的一步GD。
English
Recent works have empirically analyzed in-context learning and shown that transformers trained on synthetic linear regression tasks can learn to implement ridge regression, which is the Bayes-optimal predictor, given sufficient capacity [Aky\"urek et al., 2023], while one-layer transformers with linear self-attention and no MLP layer will learn to implement one step of gradient descent (GD) on a least-squares linear regression objective [von Oswald et al., 2022]. However, the theory behind these observations remains poorly understood. We theoretically study transformers with a single layer of linear self-attention, trained on synthetic noisy linear regression data. First, we mathematically show that when the covariates are drawn from a standard Gaussian distribution, the one-layer transformer which minimizes the pre-training loss will implement a single step of GD on the least-squares linear regression objective. Then, we find that changing the distribution of the covariates and weight vector to a non-isotropic Gaussian distribution has a strong impact on the learned algorithm: the global minimizer of the pre-training loss now implements a single step of pre-conditioned GD. However, if only the distribution of the responses is changed, then this does not have a large effect on the learned algorithm: even when the response comes from a more general family of nonlinear functions, the global minimizer of the pre-training loss still implements a single step of GD on a least-squares linear regression objective.
PDF70December 15, 2024