在Transformer中揭示mesa優化算法
Uncovering mesa-optimization algorithms in Transformers
September 11, 2023
作者: Johannes von Oswald, Eyvind Niklasson, Maximilian Schlegel, Seijin Kobayashi, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Blaise Agüera y Arcas, Max Vladymyrov, Razvan Pascanu, João Sacramento
cs.AI
摘要
Transformer 已成為深度學習中的主要模型,但其優越性能的原因尚不清楚。在這裡,我們假設 Transformer 的強大性能源於對 mesa-optimization 的架構偏好,這是一個在模型前向傳遞中運行的學習過程,包括以下兩個步驟:(i) 內部學習目標的建立,以及 (ii) 通過優化找到相應的解決方案。為了驗證這一假設,我們對一系列在簡單序列建模任務上訓練的自回歸 Transformer 進行了逆向工程,揭示了驅動預測生成的基於梯度的 mesa-optimization 算法。此外,我們展示了學習的前向傳遞優化算法可以立即重新用於解決監督式少樣本任務,這表明 mesa-optimization 可能構成大型語言模型的上下文學習能力的基礎。最後,我們提出了一個新穎的自注意力層,mesa-layer,明確且高效地解決了上下文中指定的優化問題。我們發現這一層可以提高合成和初步語言建模實驗的性能,進一步證實了 mesa-optimization 是藏在訓練過的 Transformer 權重中的重要操作的假設。
English
Transformers have become the dominant model in deep learning, but the reason
for their superior performance is poorly understood. Here, we hypothesize that
the strong performance of Transformers stems from an architectural bias towards
mesa-optimization, a learned process running within the forward pass of a model
consisting of the following two steps: (i) the construction of an internal
learning objective, and (ii) its corresponding solution found through
optimization. To test this hypothesis, we reverse-engineer a series of
autoregressive Transformers trained on simple sequence modeling tasks,
uncovering underlying gradient-based mesa-optimization algorithms driving the
generation of predictions. Moreover, we show that the learned forward-pass
optimization algorithm can be immediately repurposed to solve supervised
few-shot tasks, suggesting that mesa-optimization might underlie the in-context
learning capabilities of large language models. Finally, we propose a novel
self-attention layer, the mesa-layer, that explicitly and efficiently solves
optimization problems specified in context. We find that this layer can lead to
improved performance in synthetic and preliminary language modeling
experiments, adding weight to our hypothesis that mesa-optimization is an
important operation hidden within the weights of trained Transformers.