교대 경사 하강법과 전문가 혼합 모델을 통합한 다중 모달 인지

초록

우리는 단순하고 확장 가능한 멀티모달 다중 작업 학습 및 모델링 접근 방식인 통합 멀티모달 인지(Integrated Multimodal Perception, IMP)를 제안합니다. IMP는 이미지, 비디오, 텍스트, 오디오를 포함한 멀티모달 입력을 최소한의 모달리티별 구성 요소만으로 단일 트랜스포머 인코더에 통합합니다. IMP는 효율적인 모델 및 작업 확장을 위해 교대 경사 하강법(Alternating Gradient Descent, AGD)과 전문가 혼합(Mixture-of-Experts, MoE)을 결합한 새로운 설계를 활용합니다. 우리는 IMP에 대한 광범위한 실험적 연구를 수행하고 다음과 같은 주요 통찰을 도출했습니다: 1) 다양한 이질적 모달리티, 손실 함수, 작업에 대해 교대로 경사 하강법 업데이트를 수행하면서 입력 해상도를 다양화하는 것이 멀티모달 이해를 효율적으로 개선합니다. 2) 단일 모달리티-불특정 인코더에 MoE를 적용한 모델 희소화는 모달리티별 인코더나 추가 융합 계층을 사용하는 밀집 모델을 크게 능가하며, 모달리티 간 충돌을 크게 완화합니다. IMP는 이미지 분류, 비디오 분류, 이미지-텍스트 및 비디오-텍스트 검색을 포함한 다양한 다운스트림 작업에서 경쟁력 있는 성능을 달성합니다. 특히, 비디오 작업에 초점을 맞춘 희소 IMP-MoE-L 모델을 학습시켜 제로샷 비디오 분류에서 새로운 최첨단 성능을 달성했습니다. 우리의 모델은 Kinetics-400에서 77.0%, Kinetics-600에서 76.8%, Kinetics-700에서 76.8%의 제로샷 분류 정확도를 달성하며, 이전 최첨단 성능을 각각 +5%, +6.7%, +5.8% 향상시키면서도 전체 학습 계산 비용의 15%만 사용합니다.

English

We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model \& task scaling. We conduct extensive empirical studies about IMP and reveal the following key insights: 1) performing gradient descent updates by alternating on diverse heterogeneous modalities, loss functions, and tasks, while also varying input resolutions, efficiently improves multimodal understanding. 2) model sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigating the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including image classification, video classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L focusing on video tasks that achieves new state-of-the-art in zero-shot video classification. Our model achieves 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 76.8% on Kinetics-700 zero-shot classification accuracy, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.

교대 경사 하강법과 전문가 혼합 모델을 통합한 다중 모달 인지

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

초록

Support