勾配降下法の1ステップは、1層の線形自己注意機構を持つインコンテキスト学習器として最適であることが証明されている

要旨

最近の研究では、インコンテキスト学習を実証的に分析し、合成線形回帰タスクで訓練されたトランスフォーマーが、十分な容量が与えられれば、ベイズ最適予測器であるリッジ回帰を実装できることが示されている [Akyürek et al., 2023]。一方、線形セルフアテンションを持ちMLP層を持たない1層のトランスフォーマーは、最小二乗線形回帰の目的関数に対する勾配降下法（GD）の1ステップを実装することを学習する [von Oswald et al., 2022]。しかし、これらの観察結果の背後にある理論はまだ十分に理解されていない。本研究では、合成ノイズ付き線形回帰データで訓練された、線形セルフアテンションを1層持つトランスフォーマーを理論的に検討する。まず、共変量が標準ガウス分布から抽出される場合、事前学習損失を最小化する1層トランスフォーマーが、最小二乗線形回帰の目的関数に対するGDの1ステップを実装することを数学的に示す。次に、共変量と重みベクトルの分布を非等方ガウス分布に変更すると、学習されたアルゴリズムに強い影響を与えることがわかる：事前学習損失の大域的最小化子は、事前条件付きGDの1ステップを実装する。しかし、応答の分布のみを変更した場合、これは学習されたアルゴリズムに大きな影響を与えない：応答がより一般的な非線形関数族から来る場合でも、事前学習損失の大域的最小化子は依然として最小二乗線形回帰の目的関数に対するGDの1ステップを実装する。

English

Recent works have empirically analyzed in-context learning and shown that transformers trained on synthetic linear regression tasks can learn to implement ridge regression, which is the Bayes-optimal predictor, given sufficient capacity [Aky\"urek et al., 2023], while one-layer transformers with linear self-attention and no MLP layer will learn to implement one step of gradient descent (GD) on a least-squares linear regression objective [von Oswald et al., 2022]. However, the theory behind these observations remains poorly understood. We theoretically study transformers with a single layer of linear self-attention, trained on synthetic noisy linear regression data. First, we mathematically show that when the covariates are drawn from a standard Gaussian distribution, the one-layer transformer which minimizes the pre-training loss will implement a single step of GD on the least-squares linear regression objective. Then, we find that changing the distribution of the covariates and weight vector to a non-isotropic Gaussian distribution has a strong impact on the learned algorithm: the global minimizer of the pre-training loss now implements a single step of pre-conditioned GD. However, if only the distribution of the responses is changed, then this does not have a large effect on the learned algorithm: even when the response comes from a more general family of nonlinear functions, the global minimizer of the pre-training loss still implements a single step of GD on a least-squares linear regression objective.

勾配降下法の1ステップは、1層の線形自己注意機構を持つインコンテキスト学習器として最適であることが証明されている

One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention

要旨

Support