MUDDFormer: マルチウェイ・ダイナミック・デンス接続によるトランスフォーマーの残差ボトルネックの解消

要旨

我々は、Transformerにおける残差接続の限界を克服し、層間情報の流れを強化するためのシンプルかつ効果的な手法として、MUltiway Dynamic Dense (MUDD)接続を提案する。既存の静的で共有された接続重みを持つ密接続アプローチとは異なり、MUDDはTransformerブロックの各シーケンス位置における隠れ状態と、分離された入力ストリーム（クエリ、キー、値、または残差）に応じて動的に接続重みを生成する。MUDD接続は、あらゆるTransformerアーキテクチャにシームレスに統合され、MUDDFormerを構築することができる。広範な実験により、MUDDFormerが言語モデリングにおいて、様々なモデルアーキテクチャとスケールでTransformerを大幅に上回り、1.8倍から2.4倍の計算量で訓練されたTransformerと同等の性能を達成することが示された。特に、MUDDPythia-2.8Bは、事前学習のパープレキシティと下流タスクにおいてPythia-6.9Bに匹敵し、5ショット設定ではPythia-12Bにも匹敵する性能を発揮しながら、パラメータ数はわずか0.23%、計算量は0.4%しか増加しない。JAXとPyTorchのコードおよび事前学習済みモデルは、https://github.com/Caiyun-AI/MUDDFormer で公開されている。

English

We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving the performance of Transformers trained with 1.8X-2.4X compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code in JAX and PyTorch and pre-trained models are available at https://github.com/Caiyun-AI/MUDDFormer .

MUDDFormer: マルチウェイ・ダイナミック・デンス接続によるトランスフォーマーの残差ボトルネックの解消

MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections

要旨

Support