하나의 Wide Feedforward만 있으면 충분하다

초록

트랜스포머(Transformer) 아키텍처는 임베딩을 제외한 두 가지 주요 구성 요소로 이루어져 있습니다: 어텐션(Attention)과 피드포워드 네트워크(Feed Forward Network, FFN)입니다. 어텐션은 단어 간의 상호 의존성을 위치에 관계없이 포착하는 반면, FFN은 각 입력 토큰을 독립적으로 비선형 변환합니다. 본 연구에서는 FFN의 역할을 탐구하며, 모델 파라미터의 상당 부분을 차지함에도 불구하고 FFN이 매우 중복적이라는 사실을 발견했습니다. 구체적으로, 디코더 레이어에서 FFN을 제거하고 인코더 전체에서 단일 FFN을 공유함으로써 정확도에 큰 저하 없이 파라미터 수를 상당히 줄일 수 있었습니다. 마지막으로, 공유된 FFN의 은닉 차원을 증가시켜 아키텍처를 원래 크기로 확장함으로써, 원래의 Transformer Big 대비 정확도와 지연 시간(latency) 모두에서 상당한 개선을 달성했습니다.

English

The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

하나의 Wide Feedforward만 있으면 충분하다

One Wide Feedforward is All You Need

초록

Support