閘控循環神經網路發現注意力

摘要

最近的架構發展使得循環神經網絡（RNNs）能夠在某些序列建模任務上達到甚至超越Transformer的性能。這些現代RNNs具有一個顯著的設計模式：由前向路徑相互連接的線性循環層，並帶有乘法閘控。在這裡，我們展示了搭載這兩個設計元素的RNNs如何精確實現（線性）自注意力，這是Transformer的主要構建模塊。通過對一組訓練過的RNNs進行逆向工程，我們發現在實踐中梯度下降發現了我們的構造。特別是，我們研究了訓練有素以解決簡單上下文學習任務的RNNs，這些任務是Transformer擅長的，並發現梯度下降使我們的RNNs具有Transformer使用的基於注意力的上下文學習算法。我們的研究結果突顯了神經網絡中乘法交互作用的重要性，並暗示某些RNNs可能在幕後意外地實現了注意力機制。

English

Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.

閘控循環神經網路發現注意力

Gated recurrent neural networks discover attention

摘要

Support