Mixture-of-Transformers: マルチモーダル基盤モデルのための疎でスケーラブルなアーキテクチャ

要旨

大規模言語モデル（LLMs）の開発は、テキスト、画像、音声を統一的に処理可能なマルチモーダルシステムへと拡大している。これらのモデルの訓練は、テキストのみのLLMsと比較して、はるかに大規模なデータセットと計算リソースを必要とする。スケーリングの課題に対処するため、本研究ではMixture-of-Transformers（MoT）を提案する。MoTは、事前訓練の計算コストを大幅に削減するスパースなマルチモーダルトランスフォーマーアーキテクチャである。MoTは、フィードフォワードネットワーク、アテンションマトリックス、レイヤー正規化を含むモデルの非埋め込みパラメータをモダリティごとに分離し、全入力シーケンスに対するグローバルなセルフアテンションを可能にする。MoTは、複数の設定とモデルスケールで評価された。Chameleon 7B設定（自己回帰的なテキストと画像生成）では、MoTはFLOPsの55.8％のみを使用して、密なベースラインの性能に匹敵する。音声を含むように拡張すると、MoTはFLOPsの37.2％のみで、密なベースラインに匹敵する音声性能を達成する。Transfusion設定では、テキストと画像が異なる目的で訓練されるが、7BのMoTモデルはFLOPsの3分の1で密なベースラインの画像モダリティ性能に匹敵し、760MのMoTモデルは主要な画像生成指標において1.4Bの密なベースラインを上回る。システムプロファイリングはさらに、MoTの実用的な利点を強調し、密なベースラインの画像品質を47.2％の壁時間で、テキスト品質を75.6％の壁時間で達成する（AWS p4de.24xlargeインスタンスとNVIDIA A100 GPUで測定）。

English

The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).

Mixture-of-Transformers: マルチモーダル基盤モデルのための疎でスケーラブルなアーキテクチャ

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

要旨

Support