交替梯度下降和专家混合用于集成多模态感知

摘要

我们提出了集成多模态感知（IMP），这是一种简单且可扩展的多模态多任务训练和建模方法。IMP将包括图像、视频、文本和音频在内的多模态输入整合到一个带有最少模态特定组件的Transformer编码器中。IMP采用了一种新颖的设计，结合了交替梯度下降（AGD）和专家混合（MoE）以实现高效的模型和任务扩展。我们对IMP进行了大量实证研究，并揭示了以下关键见解：1）通过在不同异构模态、损失函数和任务上交替进行梯度下降更新，同时变化输入分辨率，可以有效提升多模态理解能力。2）在单一模态不可知编码器上使用MoE进行模型稀疏化，显著提高性能，优于使用模态特定编码器或额外融合层的密集模型，并大大减轻了模态之间的冲突。IMP在包括图像分类、视频分类、图像-文本和视频-文本检索在内的广泛下游任务中取得了竞争性表现。特别是，我们训练了一个针对视频任务的稀疏IMP-MoE-L模型，在零样本视频分类任务中实现了新的最先进水平。我们的模型在Kinetics-400上达到了77.0%、Kinetics-600上达到了76.8%、Kinetics-700上达到了76.8%的零样本分类准确率，分别比以往最先进水平提高了+5%、+6.7%和+5.8%，同时仅使用其总训练计算成本的15%。

English

We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model \& task scaling. We conduct extensive empirical studies about IMP and reveal the following key insights: 1) performing gradient descent updates by alternating on diverse heterogeneous modalities, loss functions, and tasks, while also varying input resolutions, efficiently improves multimodal understanding. 2) model sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigating the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including image classification, video classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L focusing on video tasks that achieves new state-of-the-art in zero-shot video classification. Our model achieves 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 76.8% on Kinetics-700 zero-shot classification accuracy, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.

交替梯度下降和专家混合用于集成多模态感知

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

摘要

Support