圖像、視頻、音頻和語言任務的統一模型

摘要

大型語言模型（LLMs）使得對於通用代理的雄心勃勃追求不再是一個幻想。建立這種通用模型的一個關鍵障礙是任務和模態的多樣性和異質性。一個有前途的解決方案是統一化，允許在一個統一框架內支持眾多任務和模態。雖然少數大型模型（例如Flamingo（Alayrac等，2022））在大規模數據集上訓練，可以支持兩種以上的模態，但目前的小型到中型統一模型仍然僅限於2種模態，通常是圖像-文本或視頻-文本。我們提出的問題是：是否可能高效地建立一個統一模型，可以支持所有模態？為了回答這個問題，我們提出了UnIVAL，這是朝著這個雄心勃勃的目標邁進的一步。不依賴於花俏的數據集大小或具有數十億參數的模型，約0.25B參數的UnIVAL模型超越了兩種模態，將文本、圖像、視頻和音頻統一到一個模型中。我們的模型在許多任務上經過高效預訓練，基於任務平衡和多模態課程學習。UnIVAL在圖像和視頻-文本任務中展現出與現有最先進方法相競爭的性能。從圖像和視頻-文本模態中學到的特徵表示，使得模型在音頻-文本任務上微調時能夠達到競爭性的表現，儘管未在音頻上進行預訓練。通過統一模型，我們提出了一項關於多模態模型合併的新研究，通過對訓練在不同多模態任務上的模型的權重進行插值，展示了它們對於特別是超出分布的泛化的好處。最後，我們通過展示任務之間的協同作用來激發統一化的動機。模型權重和代碼在此處發布：https://github.com/mshukor/UnIVAL。

English

Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities, usually image-text or video-text. The question that we ask is: is it possible to build efficiently a unified model that can support all modalities? To answer this, we propose UnIVAL, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. UnIVAL shows competitive performance to existing state-of-the-art approaches, across image and video-text tasks. The feature representations learned from image and video-text modalities, allows the model to achieve competitive performance when finetuned on audio-text tasks, despite not being pretrained on audio. Thanks to the unified model, we propose a novel study on multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits in particular for out-of-distribution generalization. Finally, we motivate unification by showing the synergy between tasks. The model weights and code are released here: https://github.com/mshukor/UnIVAL.

圖像、視頻、音頻和語言任務的統一模型

Unified Model for Image, Video, Audio and Language Tasks

摘要

Support