ゲート付きリカレントニューラルネットワークはアテンション機構を発見する

要旨

近年のアーキテクチャの進化により、リカレントニューラルネットワーク（RNN）は特定のシーケンスモデリングタスクにおいてTransformerの性能に追いつき、さらには凌駕するようになりました。これらの現代的なRNNは、線形リカレント層と乗算ゲートを備えたフィードフォワードパスが相互接続された設計パターンを特徴としています。本論文では、これら2つの設計要素を備えたRNNが、Transformerの主要な構成要素である（線形）セルフアテンションを正確に実装できることを示します。訓練されたRNNを逆解析することで、実際に勾配降下法が我々の構築方法を発見していることを明らかにします。特に、Transformerが優れているとされるシンプルなインコンテキスト学習タスクを解くように訓練されたRNNを調査し、勾配降下法がRNNにTransformerと同じアテンションベースのインコンテキスト学習アルゴリズムを組み込んでいることを発見しました。我々の研究結果は、ニューラルネットワークにおける乗算的相互作用の重要性を強調し、特定のRNNが予期せずアテンションを実装している可能性を示唆しています。

English

Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.

ゲート付きリカレントニューラルネットワークはアテンション機構を発見する

Gated recurrent neural networks discover attention

要旨

Support