交替梯度下降和專家混合模型用於整合多模態感知

摘要

我們提出了整合多模感知（IMP），一種簡單且可擴展的多模多任務訓練和建模方法。IMP將包括圖像、視頻、文本和音頻在內的多模輸入整合到單個Transformer編碼器中，並最大程度地減少了與模態特定組件相關的部分。IMP利用一種新穎的設計，結合了交替梯度下降（AGD）和專家混合（MoE）以實現高效的模型和任務擴展。我們對IMP進行了廣泛的實證研究，揭示了以下關鍵見解：1）通過在不同異構模態、損失函數和任務之間交替進行梯度下降更新，同時變化輸入分辨率，有效提高了多模理解能力。2）在單一模態不可知編碼器上使用MoE進行模型稀疏化，顯著提高了性能，優於使用模態特定編碼器或額外融合層的密集模型，並極大地減輕了模態之間的衝突。IMP在包括圖像分類、視頻分類、圖像文本和視頻文本檢索在內的廣泛下游任務中實現了競爭性表現。特別是，我們訓練了一個針對視頻任務的稀疏IMP-MoE-L模型，在零樣本視頻分類方面實現了新的最先進水平。我們的模型在Kinetics-400上實現了77.0％，在Kinetics-600上實現了76.8％，在Kinetics-700上實現了76.8％的零樣本分類準確率，分別比以前的最先進水平提高了+5％，+6.7％和+5.8％，同時僅使用其總訓練計算成本的15％。

English

We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model \& task scaling. We conduct extensive empirical studies about IMP and reveal the following key insights: 1) performing gradient descent updates by alternating on diverse heterogeneous modalities, loss functions, and tasks, while also varying input resolutions, efficiently improves multimodal understanding. 2) model sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigating the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including image classification, video classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L focusing on video tasks that achieves new state-of-the-art in zero-shot video classification. Our model achieves 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 76.8% on Kinetics-700 zero-shot classification accuracy, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.

交替梯度下降和專家混合模型用於整合多模態感知

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

摘要

Support