이미지, 비디오, 오디오 및 언어 작업을 위한 통합 모델

초록

대형 언어 모델(LLMs)은 범용 에이전트를 개발하려는 야심찬 목표가 더 이상 공상이 아니게 만들었습니다. 이러한 범용 모델을 구축하는 데 있어 주요 장애물은 과제와 모달리티의 다양성과 이질성입니다. 이를 해결할 수 있는 유망한 방법은 통합으로, 하나의 통합 프레임워크 내에서 다양한 과제와 모달리티를 지원하는 것입니다. 대규모 데이터셋으로 학습된 Flamingo(Alayrac et al., 2022)와 같은 몇몇 대형 모델은 두 가지 이상의 모달리티를 지원할 수 있지만, 현재의 소규모 및 중간 규모 통합 모델은 여전히 이미지-텍스트 또는 비디오-텍스트와 같이 두 가지 모달리티로 제한됩니다. 우리가 던지는 질문은: 모든 모달리티를 지원할 수 있는 통합 모델을 효율적으로 구축할 수 있는가? 이에 대한 답으로, 우리는 이 야심찬 목표를 향한 한 걸음 더 나아간 UnIVAL을 제안합니다. 거대한 데이터셋 크기나 수십억 개의 파라미터를 가진 모델에 의존하지 않고, 약 0.25B 파라미터의 UnIVAL 모델은 두 가지 이상의 모달리티를 넘어 텍스트, 이미지, 비디오, 오디오를 하나의 모델로 통합합니다. 우리의 모델은 과제 균형과 다중 모달리티 커리큘럼 학습을 기반으로 많은 과제에 대해 효율적으로 사전 학습됩니다. UnIVAL은 이미지 및 비디오-텍스트 과제에서 기존의 최첨단 접근 방식과 경쟁력 있는 성능을 보여줍니다. 이미지와 비디오-텍스트 모달리티에서 학습된 특징 표현 덕분에, 오디오에 사전 학습되지 않았음에도 불구하고 오디오-텍스트 과제에 미세 조정 시 경쟁력 있는 성능을 달성할 수 있습니다. 통합 모델 덕분에, 우리는 서로 다른 다중 모달리티 과제에서 학습된 모델의 가중치 보간을 통한 다중 모달리티 모델 병합에 대한 새로운 연구를 제안하며, 특히 분포 외 일반화에서의 이점을 보여줍니다. 마지막으로, 우리는 과제 간의 시너지를 보여줌으로써 통합의 동기를 부여합니다. 모델 가중치와 코드는 여기에서 공개됩니다: https://github.com/mshukor/UnIVAL.

English

Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities, usually image-text or video-text. The question that we ask is: is it possible to build efficiently a unified model that can support all modalities? To answer this, we propose UnIVAL, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. UnIVAL shows competitive performance to existing state-of-the-art approaches, across image and video-text tasks. The feature representations learned from image and video-text modalities, allows the model to achieve competitive performance when finetuned on audio-text tasks, despite not being pretrained on audio. Thanks to the unified model, we propose a novel study on multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits in particular for out-of-distribution generalization. Finally, we motivate unification by showing the synergy between tasks. The model weights and code are released here: https://github.com/mshukor/UnIVAL.

이미지, 비디오, 오디오 및 언어 작업을 위한 통합 모델

Unified Model for Image, Video, Audio and Language Tasks

초록

Support