在Transformer模型中揭示Mesa优化算法

摘要

Transformer 已经成为深度学习中的主导模型，但其卓越性能的原因尚不明确。在这里，我们假设Transformer 的强大性能源于一种架构偏好，即面向 mesa-优化的学习过程，这是模型前向传播中运行的一种学习过程，包括以下两个步骤：(i) 内部学习目标的构建，以及 (ii) 通过优化找到相应的解决方案。为了验证这一假设，我们对一系列在简单序列建模任务上训练的自回归 Transformer 进行了逆向工程，揭示了驱动预测生成的基础基于梯度的 mesa-优化算法。此外，我们展示了学习的前向传播优化算法可以立即重新用于解决监督式少样本任务，这表明 mesa-优化可能潜在地支撑大型语言模型的上下文学习能力。最后，我们提出了一种新颖的自注意力层，即 mesa-层，明确且高效地解决了上下文中指定的优化问题。我们发现，这一层可以在合成和初步语言建模实验中提高性能，从而加强了我们的假设，即 mesa-优化是隐藏在训练后的 Transformer 权重中的重要操作。

English

Transformers have become the dominant model in deep learning, but the reason for their superior performance is poorly understood. Here, we hypothesize that the strong performance of Transformers stems from an architectural bias towards mesa-optimization, a learned process running within the forward pass of a model consisting of the following two steps: (i) the construction of an internal learning objective, and (ii) its corresponding solution found through optimization. To test this hypothesis, we reverse-engineer a series of autoregressive Transformers trained on simple sequence modeling tasks, uncovering underlying gradient-based mesa-optimization algorithms driving the generation of predictions. Moreover, we show that the learned forward-pass optimization algorithm can be immediately repurposed to solve supervised few-shot tasks, suggesting that mesa-optimization might underlie the in-context learning capabilities of large language models. Finally, we propose a novel self-attention layer, the mesa-layer, that explicitly and efficiently solves optimization problems specified in context. We find that this layer can lead to improved performance in synthetic and preliminary language modeling experiments, adding weight to our hypothesis that mesa-optimization is an important operation hidden within the weights of trained Transformers.

在Transformer模型中揭示Mesa优化算法

Uncovering mesa-optimization algorithms in Transformers

摘要

Support