交替梯度下降和专家混合用于集成多模态感知
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
May 10, 2023
作者: Hassan Akbari, Dan Kondratyuk, Yin Cui, Rachel Hornung, Huisheng Wang, Hartwig Adam
cs.AI
摘要
我们提出了集成多模态感知(IMP),这是一种简单且可扩展的多模态多任务训练和建模方法。IMP将包括图像、视频、文本和音频在内的多模态输入整合到一个带有最少模态特定组件的Transformer编码器中。IMP采用了一种新颖的设计,结合了交替梯度下降(AGD)和专家混合(MoE)以实现高效的模型和任务扩展。我们对IMP进行了大量实证研究,并揭示了以下关键见解:1)通过在不同异构模态、损失函数和任务上交替进行梯度下降更新,同时变化输入分辨率,可以有效提升多模态理解能力。2)在单一模态不可知编码器上使用MoE进行模型稀疏化,显著提高性能,优于使用模态特定编码器或额外融合层的密集模型,并大大减轻了模态之间的冲突。IMP在包括图像分类、视频分类、图像-文本和视频-文本检索在内的广泛下游任务中取得了竞争性表现。特别是,我们训练了一个针对视频任务的稀疏IMP-MoE-L模型,在零样本视频分类任务中实现了新的最先进水平。我们的模型在Kinetics-400上达到了77.0%、Kinetics-600上达到了76.8%、Kinetics-700上达到了76.8%的零样本分类准确率,分别比以往最先进水平提高了+5%、+6.7%和+5.8%,同时仅使用其总训练计算成本的15%。
English
We present Integrated Multimodal Perception (IMP), a simple and scalable
multimodal multi-task training and modeling approach. IMP integrates multimodal
inputs including image, video, text, and audio into a single Transformer
encoder with minimal modality-specific components. IMP makes use of a novel
design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts
(MoE) for efficient model \& task scaling. We conduct extensive empirical
studies about IMP and reveal the following key insights: 1) performing gradient
descent updates by alternating on diverse heterogeneous modalities, loss
functions, and tasks, while also varying input resolutions, efficiently
improves multimodal understanding. 2) model sparsification with MoE on a single
modality-agnostic encoder substantially improves the performance, outperforming
dense models that use modality-specific encoders or additional fusion layers
and greatly mitigating the conflicts between modalities. IMP achieves
competitive performance on a wide range of downstream tasks including image
classification, video classification, image-text, and video-text retrieval.
Most notably, we train a sparse IMP-MoE-L focusing on video tasks that achieves
new state-of-the-art in zero-shot video classification. Our model achieves
77.0% on Kinetics-400, 76.8% on Kinetics-600, and 76.8% on Kinetics-700
zero-shot classification accuracy, improving the previous state-of-the-art by
+5%, +6.7%, and +5.8%, respectively, while using only 15% of their total
training computational cost.