ワイドなフィードフォワード層一つで十分

要旨

Transformerアーキテクチャには、埋め込み層以外に2つの主要なコンポーネントが存在します：AttentionとFeed Forward Network（FFN）です。Attentionは単語間の相互依存関係を位置に関係なく捉えるのに対し、FFNは各入力トークンを独立して非線形変換します。本研究ではFFNの役割を探り、モデルのパラメータの大部分を占めるにもかかわらず、FFNが高度に冗長であることを発見しました。具体的には、デコーダ層のFFNを除去し、エンコーダ全体で単一のFFNを共有することで、精度の低下を最小限に抑えつつ大幅なパラメータ削減を実現しました。最後に、共有FFNの隠れ層次元を増やすことでアーキテクチャを元のサイズに戻し、オリジナルのTransformer Bigと比較して精度とレイテンシの両面で大幅な向上を達成しました。

English

The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

ワイドなフィードフォワード層一つで十分

One Wide Feedforward is All You Need

要旨

Support