一個廣泛的前饋就是你所需要的。

摘要

Transformer架構有兩個主要的非嵌入式組件：注意力和前饋網路（FFN）。注意力捕捉單詞之間的相互依賴性，而不受其位置的影響，而FFN則對每個輸入標記進行非線性轉換。在這項工作中，我們探討了FFN的作用，發現儘管它佔據模型參數的相當大比例，但它是高度冗餘的。具體來說，我們能夠通過刪除解碼器層上的FFN並在編碼器之間共享單個FFN，從而大幅減少參數的數量，同時僅略微降低準確性。最後，我們通過增加共享FFN的隱藏維度，將這種架構恢復到其原始大小，實現了在準確性和延遲方面相對於原始Transformer Big的實質增益。

English

The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

一個廣泛的前饋就是你所需要的。

One Wide Feedforward is All You Need

摘要

Support