게이트 순환 신경망이 주의 메커니즘을 발견하다

초록

최근의 아키텍처 발전으로 인해 순환 신경망(RNN)이 특정 시퀀스 모델링 작업에서 트랜스포머의 성능을 따라잡거나 심지어 능가할 수 있게 되었습니다. 이러한 현대적인 RNN은 두드러진 디자인 패턴을 특징으로 합니다: 곱셈 게이팅이 있는 피드포워드 경로로 상호 연결된 선형 순환 레이어입니다. 여기서 우리는 이러한 두 가지 디자인 요소를 갖춘 RNN이 트랜스포머의 주요 구성 요소인 (선형) 자기 주의(self-attention)를 정확히 구현할 수 있음을 보여줍니다. 훈련된 RNN 세트를 역공학적으로 분석함으로써, 우리는 실제로 경사 하강법이 우리의 구성을 발견한다는 것을 확인했습니다. 특히, 우리는 트랜스포머가 뛰어난 성능을 보이는 것으로 알려진 간단한 컨텍스트 내 학습 작업을 해결하도록 훈련된 RNN을 조사했고, 경사 하강법이 우리의 RNN에 트랜스포머가 사용하는 것과 동일한 주의 기반 컨텍스트 내 학습 알고리즘을 심어준다는 것을 발견했습니다. 우리의 연구 결과는 신경망에서 곱셈 상호작용의 중요성을 강조하며, 특정 RNN이 내부적으로 주의 메커니즘을 구현하고 있을 수 있다는 점을 시사합니다.

English

Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.

게이트 순환 신경망이 주의 메커니즘을 발견하다

Gated recurrent neural networks discover attention

초록

Support