Une seule couche feedforward large suffit

papers.abstract

L'architecture Transformer comporte deux composants principaux hors plongement lexical : l'attention et le réseau feed-forward (FFN). L'attention capture les interdépendances entre les mots indépendamment de leur position, tandis que le FFN transforme de manière non linéaire chaque token d'entrée de façon indépendante. Dans ce travail, nous explorons le rôle du FFN et constatons que, bien qu'il occupe une part significative des paramètres du modèle, il est hautement redondant. Concrètement, nous parvenons à réduire considérablement le nombre de paramètres avec seulement une légère baisse de précision en supprimant le FFN des couches de décodeur et en partageant un seul FFN à travers l'encodeur. Enfin, nous redimensionnons cette architecture à sa taille d'origine en augmentant la dimension cachée du FFN partagé, obtenant des gains substantiels à la fois en précision et en latence par rapport au Transformer Big original.

English

The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

Une seule couche feedforward large suffit

One Wide Feedforward is All You Need

papers.abstract

Support