一种宽度前馈就足够

摘要

Transformer架构有两个主要的非嵌入式组件：注意力机制和前馈神经网络（FFN）。注意力机制捕捉单词之间的相互依赖关系，而FFN则非线性地独立地转换每个输入标记。在这项工作中，我们探讨了FFN的作用，并发现尽管它占据模型参数的相当大比例，但它是高度冗余的。具体来说，我们能够通过移除解码器层上的FFN并在编码器之间共享单个FFN，大幅减少参数数量，仅在精度上略微下降。最后，我们通过增加共享FFN的隐藏维度，将这种架构缩放回原始大小，实现了在精度和延迟方面相对于原始Transformer Big的实质性增益。

English

The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

一种宽度前馈就足够

One Wide Feedforward is All You Need

摘要

Support