在Transformer模型中揭示Mesa优化算法
Uncovering mesa-optimization algorithms in Transformers
September 11, 2023
作者: Johannes von Oswald, Eyvind Niklasson, Maximilian Schlegel, Seijin Kobayashi, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Blaise Agüera y Arcas, Max Vladymyrov, Razvan Pascanu, João Sacramento
cs.AI
摘要
Transformer 已经成为深度学习中的主导模型,但其卓越性能的原因尚不明确。在这里,我们假设Transformer 的强大性能源于一种架构偏好,即面向 mesa-优化的学习过程,这是模型前向传播中运行的一种学习过程,包括以下两个步骤:(i) 内部学习目标的构建,以及 (ii) 通过优化找到相应的解决方案。为了验证这一假设,我们对一系列在简单序列建模任务上训练的自回归 Transformer 进行了逆向工程,揭示了驱动预测生成的基础基于梯度的 mesa-优化算法。此外,我们展示了学习的前向传播优化算法可以立即重新用于解决监督式少样本任务,这表明 mesa-优化可能潜在地支撑大型语言模型的上下文学习能力。最后,我们提出了一种新颖的自注意力层,即 mesa-层,明确且高效地解决了上下文中指定的优化问题。我们发现,这一层可以在合成和初步语言建模实验中提高性能,从而加强了我们的假设,即 mesa-优化是隐藏在训练后的 Transformer 权重中的重要操作。
English
Transformers have become the dominant model in deep learning, but the reason
for their superior performance is poorly understood. Here, we hypothesize that
the strong performance of Transformers stems from an architectural bias towards
mesa-optimization, a learned process running within the forward pass of a model
consisting of the following two steps: (i) the construction of an internal
learning objective, and (ii) its corresponding solution found through
optimization. To test this hypothesis, we reverse-engineer a series of
autoregressive Transformers trained on simple sequence modeling tasks,
uncovering underlying gradient-based mesa-optimization algorithms driving the
generation of predictions. Moreover, we show that the learned forward-pass
optimization algorithm can be immediately repurposed to solve supervised
few-shot tasks, suggesting that mesa-optimization might underlie the in-context
learning capabilities of large language models. Finally, we propose a novel
self-attention layer, the mesa-layer, that explicitly and efficiently solves
optimization problems specified in context. We find that this layer can lead to
improved performance in synthetic and preliminary language modeling
experiments, adding weight to our hypothesis that mesa-optimization is an
important operation hidden within the weights of trained Transformers.