Transformer混合模型：一種稀疏且可擴展的架構，用於多模態基礎模型

摘要

大型語言模型（LLMs）的發展已擴展到多模態系統，能夠在統一框架內處理文本、圖像和語音。與僅處理文本的LLMs相比，訓練這些模型需要更大量的數據集和計算資源。為了應對規模挑戰，我們引入了Mixture-of-Transformers（MoT），這是一種稀疏的多模態Transformer架構，可以顯著降低預訓練的計算成本。MoT通過模態解耦模型的非嵌入參數，包括前饋網絡、注意力矩陣和層正規化，實現了對完整輸入序列的全局自注意力的模態特定處理。我們在多個設置和模型規模上評估MoT。在Chameleon 7B設置中（自回歸文本和圖像生成），MoT僅使用55.8％的FLOPs即可達到與密集基準性能相匹配的水平。當擴展到包含語音時，MoT僅使用37.2％的FLOPs即可達到與密集基準相當的語音性能。在Transfusion設置中，其中文本和圖像是根據不同目標進行訓練，一個7B的MoT模型以三分之一的FLOPs即可達到與密集基準相同的圖像模態性能，而一個760M的MoT模型在關鍵圖像生成指標上優於1.4B的密集基準。系統分析進一步凸顯了MoT的實際優勢，在AWS p4de.24xlarge實例上（搭載NVIDIA A100 GPU）的牆鐘時間中，以47.2％的時間達到密集基準的圖像質量，以75.6％的時間達到文本質量。

English

The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).

Transformer混合模型：一種稀疏且可擴展的架構，用於多模態基礎模型

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

摘要

Support