门控循环神经网络发现注意力
Gated recurrent neural networks discover attention
September 4, 2023
作者: Nicolas Zucchet, Seijin Kobayashi, Yassir Akram, Johannes von Oswald, Maxime Larcher, Angelika Steger, João Sacramento
cs.AI
摘要
最近的架构发展使得循环神经网络(RNNs)能够在某些序列建模任务上达到甚至超过Transformer的性能。这些现代RNNs具有一个显著的设计模式:由前馈路径连接的线性循环层,带有乘法门控。在这里,我们展示了装备了这两个设计元素的RNNs如何精确实现(线性)自注意力,这是Transformer的主要构建模块。通过逆向工程一组经过训练的RNNs,我们发现在实践中梯度下降发现了我们的构建。特别是,我们研究了训练用于解决简单上下文学习任务的RNNs,在这些任务上Transformer以优异表现著称,并发现梯度下降赋予我们的RNNs与Transformer使用的基于注意力的上下文学习算法相同的能力。我们的发现突显了神经网络中乘法交互的重要性,并暗示某些RNNs可能在幕后意外地实现了注意力机制。
English
Recent architectural developments have enabled recurrent neural networks
(RNNs) to reach and even surpass the performance of Transformers on certain
sequence modeling tasks. These modern RNNs feature a prominent design pattern:
linear recurrent layers interconnected by feedforward paths with multiplicative
gating. Here, we show how RNNs equipped with these two design elements can
exactly implement (linear) self-attention, the main building block of
Transformers. By reverse-engineering a set of trained RNNs, we find that
gradient descent in practice discovers our construction. In particular, we
examine RNNs trained to solve simple in-context learning tasks on which
Transformers are known to excel and find that gradient descent instills in our
RNNs the same attention-based in-context learning algorithm used by
Transformers. Our findings highlight the importance of multiplicative
interactions in neural networks and suggest that certain RNNs might be
unexpectedly implementing attention under the hood.