门控循环神经网络发现注意力

摘要

最近的架构发展使得循环神经网络（RNNs）能够在某些序列建模任务上达到甚至超过Transformer的性能。这些现代RNNs具有一个显著的设计模式：由前馈路径连接的线性循环层，带有乘法门控。在这里，我们展示了装备了这两个设计元素的RNNs如何精确实现（线性）自注意力，这是Transformer的主要构建模块。通过逆向工程一组经过训练的RNNs，我们发现在实践中梯度下降发现了我们的构建。特别是，我们研究了训练用于解决简单上下文学习任务的RNNs，在这些任务上Transformer以优异表现著称，并发现梯度下降赋予我们的RNNs与Transformer使用的基于注意力的上下文学习算法相同的能力。我们的发现突显了神经网络中乘法交互的重要性，并暗示某些RNNs可能在幕后意外地实现了注意力机制。

English

Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.

门控循环神经网络发现注意力

Gated recurrent neural networks discover attention

摘要

Support