交替梯度下降和專家混合模型用於整合多模態感知
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
May 10, 2023
作者: Hassan Akbari, Dan Kondratyuk, Yin Cui, Rachel Hornung, Huisheng Wang, Hartwig Adam
cs.AI
摘要
我們提出了整合多模感知(IMP),一種簡單且可擴展的多模多任務訓練和建模方法。IMP將包括圖像、視頻、文本和音頻在內的多模輸入整合到單個Transformer編碼器中,並最大程度地減少了與模態特定組件相關的部分。IMP利用一種新穎的設計,結合了交替梯度下降(AGD)和專家混合(MoE)以實現高效的模型和任務擴展。我們對IMP進行了廣泛的實證研究,揭示了以下關鍵見解:1)通過在不同異構模態、損失函數和任務之間交替進行梯度下降更新,同時變化輸入分辨率,有效提高了多模理解能力。2)在單一模態不可知編碼器上使用MoE進行模型稀疏化,顯著提高了性能,優於使用模態特定編碼器或額外融合層的密集模型,並極大地減輕了模態之間的衝突。IMP在包括圖像分類、視頻分類、圖像文本和視頻文本檢索在內的廣泛下游任務中實現了競爭性表現。特別是,我們訓練了一個針對視頻任務的稀疏IMP-MoE-L模型,在零樣本視頻分類方面實現了新的最先進水平。我們的模型在Kinetics-400上實現了77.0%,在Kinetics-600上實現了76.8%,在Kinetics-700上實現了76.8%的零樣本分類準確率,分別比以前的最先進水平提高了+5%,+6.7%和+5.8%,同時僅使用其總訓練計算成本的15%。
English
We present Integrated Multimodal Perception (IMP), a simple and scalable
multimodal multi-task training and modeling approach. IMP integrates multimodal
inputs including image, video, text, and audio into a single Transformer
encoder with minimal modality-specific components. IMP makes use of a novel
design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts
(MoE) for efficient model \& task scaling. We conduct extensive empirical
studies about IMP and reveal the following key insights: 1) performing gradient
descent updates by alternating on diverse heterogeneous modalities, loss
functions, and tasks, while also varying input resolutions, efficiently
improves multimodal understanding. 2) model sparsification with MoE on a single
modality-agnostic encoder substantially improves the performance, outperforming
dense models that use modality-specific encoders or additional fusion layers
and greatly mitigating the conflicts between modalities. IMP achieves
competitive performance on a wide range of downstream tasks including image
classification, video classification, image-text, and video-text retrieval.
Most notably, we train a sparse IMP-MoE-L focusing on video tasks that achieves
new state-of-the-art in zero-shot video classification. Our model achieves
77.0% on Kinetics-400, 76.8% on Kinetics-600, and 76.8% on Kinetics-700
zero-shot classification accuracy, improving the previous state-of-the-art by
+5%, +6.7%, and +5.8%, respectively, while using only 15% of their total
training computational cost.