Alternerende Gradientdaling en Mixture-of-Experts voor Geïntegreerde Multimodale Waarneming

Samenvatting

We presenteren Integrated Multimodal Perception (IMP), een eenvoudige en schaalbare multimodale multi-task trainings- en modelleerbenadering. IMP integreert multimodale invoer, waaronder beeld, video, tekst en audio, in een enkele Transformer-encoder met minimale modale specifieke componenten. IMP maakt gebruik van een nieuw ontwerp dat Alternating Gradient Descent (AGD) en Mixture-of-Experts (MoE) combineert voor efficiënte model- en taskschaling. We voeren uitgebreide empirische studies uit over IMP en onthullen de volgende belangrijke inzichten: 1) het uitvoeren van gradient descent-updates door af te wisselen op diverse heterogene modaliteiten, verliesfuncties en taken, terwijl ook de invoerresoluties worden gevarieerd, verbetert multimodaal begrip efficiënt. 2) modelsparsificatie met MoE op een enkele modale-agnostische encoder verbetert de prestaties aanzienlijk, waarbij dichte modellen die modale specifieke encoders of extra fusielagen gebruiken, worden overtroffen en de conflicten tussen modaliteiten sterk worden verminderd. IMP behaalt competitieve prestaties op een breed scala aan downstream taken, waaronder beeldclassificatie, videoclassificatie, beeld-tekst- en video-tekstretrieval. Opmerkelijk is dat we een sparse IMP-MoE-L trainen die zich richt op videotaken en een nieuwe state-of-the-art bereikt in zero-shot videoclassificatie. Ons model behaalt 77,0% op Kinetics-400, 76,8% op Kinetics-600 en 76,8% op Kinetics-700 zero-shot classificatienauwkeurigheid, wat de vorige state-of-the-art verbetert met respectievelijk +5%, +6,7% en +5,8%, terwijl slechts 15% van hun totale trainingscomputatiekosten wordt gebruikt.

English

We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model \& task scaling. We conduct extensive empirical studies about IMP and reveal the following key insights: 1) performing gradient descent updates by alternating on diverse heterogeneous modalities, loss functions, and tasks, while also varying input resolutions, efficiently improves multimodal understanding. 2) model sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigating the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including image classification, video classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L focusing on video tasks that achieves new state-of-the-art in zero-shot video classification. Our model achieves 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 76.8% on Kinetics-700 zero-shot classification accuracy, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.

Alternerende Gradientdaling en Mixture-of-Experts voor Geïntegreerde Multimodale Waarneming

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

Samenvatting

Support