Brainformers: 効率性と引き換えるシンプルさ

要旨

Transformerは、自然言語処理とコンピュータビジョンにおける最近の成功の中核をなす技術です。Transformerは、主に均一なバックボーンを持ち、フィードフォワード層とセルフアテンション層を交互に配置することで深いネットワークを構築します。本研究では、この設計選択を検証し、層プリミティブの異なる順列を持つより複雑なブロックが効率的であることを発見しました。この洞察を基に、スパースゲート付きフィードフォワード層、密なフィードフォワード層、アテンション層、および様々な形式のレイヤー正規化と活性化関数を含む多様な層セットからなる複雑なブロック、Brainformerを開発しました。Brainformerは、品質と効率の両面で、最先端の密なTransformerおよびスパースTransformerを一貫して上回ります。トークンあたり80億の活性化パラメータを持つBrainformerモデルは、GLaMと比較して2倍の訓練収束速度と5倍のステップ時間を実証しました。下流タスク評価では、Brainformerは、同程度の活性化パラメータ数を持つGLaMと比較して、ファインチューニング後のSuperGLUEスコアが3%高くなりました。最後に、Brainformerは、トークンあたりの計算量が類似したNASから導出されたPrimer密モデルを、Few-shot評価において大幅に上回りました。

English

Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations.

Brainformers: 効率性と引き換えるシンプルさ

Brainformers: Trading Simplicity for Efficiency

要旨

Support