Transformer混合模型:一種稀疏且可擴展的架構,用於多模態基礎模型
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
November 7, 2024
作者: Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin
cs.AI
摘要
大型語言模型(LLMs)的發展已擴展到多模態系統,能夠在統一框架內處理文本、圖像和語音。與僅處理文本的LLMs相比,訓練這些模型需要更大量的數據集和計算資源。為了應對規模挑戰,我們引入了Mixture-of-Transformers(MoT),這是一種稀疏的多模態Transformer架構,可以顯著降低預訓練的計算成本。MoT通過模態解耦模型的非嵌入參數,包括前饋網絡、注意力矩陣和層正規化,實現了對完整輸入序列的全局自注意力的模態特定處理。我們在多個設置和模型規模上評估MoT。在Chameleon 7B設置中(自回歸文本和圖像生成),MoT僅使用55.8%的FLOPs即可達到與密集基準性能相匹配的水平。當擴展到包含語音時,MoT僅使用37.2%的FLOPs即可達到與密集基準相當的語音性能。在Transfusion設置中,其中文本和圖像是根據不同目標進行訓練,一個7B的MoT模型以三分之一的FLOPs即可達到與密集基準相同的圖像模態性能,而一個760M的MoT模型在關鍵圖像生成指標上優於1.4B的密集基準。系統分析進一步凸顯了MoT的實際優勢,在AWS p4de.24xlarge實例上(搭載NVIDIA A100 GPU)的牆鐘時間中,以47.2%的時間達到密集基準的圖像質量,以75.6%的時間達到文本質量。
English
The development of large language models (LLMs) has expanded to multi-modal
systems capable of processing text, images, and speech within a unified
framework. Training these models demands significantly larger datasets and
computational resources compared to text-only LLMs. To address the scaling
challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal
transformer architecture that significantly reduces pretraining computational
costs. MoT decouples non-embedding parameters of the model by modality --
including feed-forward networks, attention matrices, and layer normalization --
enabling modality-specific processing with global self-attention over the full
input sequence. We evaluate MoT across multiple settings and model scales. In
the Chameleon 7B setting (autoregressive text-and-image generation), MoT
matches the dense baseline's performance using only 55.8\% of the FLOPs. When
extended to include speech, MoT reaches speech performance comparable to the
dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where
text and image are trained with different objectives, a 7B MoT model matches
the image modality performance of the dense baseline with one third of the
FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image
generation metrics. System profiling further highlights MoT's practical
benefits, achieving dense baseline image quality in 47.2\% of the wall-clock
time and text quality in 75.6\% of the wall-clock time (measured on AWS
p4de.24xlarge instances with NVIDIA A100 GPUs).