閘控循環神經網路發現注意力
Gated recurrent neural networks discover attention
September 4, 2023
作者: Nicolas Zucchet, Seijin Kobayashi, Yassir Akram, Johannes von Oswald, Maxime Larcher, Angelika Steger, João Sacramento
cs.AI
摘要
最近的架構發展使得循環神經網絡(RNNs)能夠在某些序列建模任務上達到甚至超越Transformer的性能。這些現代RNNs具有一個顯著的設計模式:由前向路徑相互連接的線性循環層,並帶有乘法閘控。在這裡,我們展示了搭載這兩個設計元素的RNNs如何精確實現(線性)自注意力,這是Transformer的主要構建模塊。通過對一組訓練過的RNNs進行逆向工程,我們發現在實踐中梯度下降發現了我們的構造。特別是,我們研究了訓練有素以解決簡單上下文學習任務的RNNs,這些任務是Transformer擅長的,並發現梯度下降使我們的RNNs具有Transformer使用的基於注意力的上下文學習算法。我們的研究結果突顯了神經網絡中乘法交互作用的重要性,並暗示某些RNNs可能在幕後意外地實現了注意力機制。
English
Recent architectural developments have enabled recurrent neural networks
(RNNs) to reach and even surpass the performance of Transformers on certain
sequence modeling tasks. These modern RNNs feature a prominent design pattern:
linear recurrent layers interconnected by feedforward paths with multiplicative
gating. Here, we show how RNNs equipped with these two design elements can
exactly implement (linear) self-attention, the main building block of
Transformers. By reverse-engineering a set of trained RNNs, we find that
gradient descent in practice discovers our construction. In particular, we
examine RNNs trained to solve simple in-context learning tasks on which
Transformers are known to excel and find that gradient descent instills in our
RNNs the same attention-based in-context learning algorithm used by
Transformers. Our findings highlight the importance of multiplicative
interactions in neural networks and suggest that certain RNNs might be
unexpectedly implementing attention under the hood.