交互勾配降下法とエキスパート混合モデルによる統合型マルチモーダル知覚

要旨

本論文では、Integrated Multimodal Perception (IMP)を提案する。これは、シンプルでスケーラブルなマルチモーダル・マルチタスク学習およびモデリング手法である。IMPは、画像、動画、テキスト、音声といったマルチモーダル入力を、最小限のモダリティ固有コンポーネントのみを用いて単一のTransformerエンコーダに統合する。IMPは、効率的なモデル＆タスクスケーリングのため、Alternating Gradient Descent (AGD)とMixture-of-Experts (MoE)を組み合わせた新たな設計を採用している。IMPに関する広範な実証研究を行い、以下の重要な知見を得た：1) 多様な異種モダリティ、損失関数、タスクに対して交互に勾配降下法を適用し、入力解像度も変化させることで、マルチモーダル理解が効率的に向上する。2) 単一のモダリティ非依存エンコーダ上でMoEを用いたモデルの疎化は、モダリティ固有エンコーダや追加の融合層を使用する密なモデルを凌駕し、モダリティ間の競合を大幅に緩和する。IMPは、画像分類、動画分類、画像-テキスト検索、動画-テキスト検索といった幅広い下流タスクにおいて競争力のある性能を達成する。特に注目すべきは、動画タスクに焦点を当てた疎なIMP-MoE-Lモデルを学習し、ゼロショット動画分類において新たなstate-of-the-artを達成した点である。本モデルは、Kinetics-400で77.0%、Kinetics-600で76.8%、Kinetics-700で76.8%のゼロショット分類精度を達成し、従来のstate-of-the-artをそれぞれ+5%、+6.7%、+5.8%向上させた。これらは、総学習計算コストのわずか15%しか使用せずに達成されたものである。

English

We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model \& task scaling. We conduct extensive empirical studies about IMP and reveal the following key insights: 1) performing gradient descent updates by alternating on diverse heterogeneous modalities, loss functions, and tasks, while also varying input resolutions, efficiently improves multimodal understanding. 2) model sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigating the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including image classification, video classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L focusing on video tasks that achieves new state-of-the-art in zero-shot video classification. Our model achieves 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 76.8% on Kinetics-700 zero-shot classification accuracy, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.

交互勾配降下法とエキスパート混合モデルによる統合型マルチモーダル知覚

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

要旨

Support