图像、视频、音频和语言任务的统一模型

摘要

大型语言模型（LLMs）使得对通用代理的雄心勃勃追求不再是幻想。构建这样通用模型的一个关键障碍是任务和模态的多样性和异质性。一种有前途的解决方案是统一化，允许在一个统一框架内支持多种任务和模态。虽然少数大型模型（例如，Flamingo（Alayrac等，2022））在大规模数据集上训练，可以支持两种以上的模态，但当前的小到中等规模统一模型仍然局限于两种模态，通常是图像-文本或视频-文本。我们提出的问题是：是否可能高效地构建一个能够支持所有模态的统一模型？为了回答这个问题，我们提出了UnIVAL，这是迈向这一雄心目标的一步。不依赖于花哨的数据集大小或拥有数十亿参数的模型，约0.25B参数的UnIVAL模型超越了两种模态，将文本、图像、视频和音频统一到一个模型中。我们的模型在许多任务上经过高效的预训练，基于任务平衡和多模态课程学习。UnIVAL在图像和视频-文本任务中展现出与现有最先进方法竞争力的表现。从图像和视频-文本模态学到的特征表示，使得该模型在音频-文本任务上微调时也能取得竞争性表现，尽管没有在音频上进行预训练。借助统一模型，我们提出了一项关于多模态模型合并的新颖研究，通过对在不同多模态任务上训练的模型进行权重插值，展示了它们在特别适用于超出分布的泛化方面的优势。最后，我们通过展示任务之间的协同作用来激励统一化。模型权重和代码在此处发布：https://github.com/mshukor/UnIVAL。

English

Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities, usually image-text or video-text. The question that we ask is: is it possible to build efficiently a unified model that can support all modalities? To answer this, we propose UnIVAL, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. UnIVAL shows competitive performance to existing state-of-the-art approaches, across image and video-text tasks. The feature representations learned from image and video-text modalities, allows the model to achieve competitive performance when finetuned on audio-text tasks, despite not being pretrained on audio. Thanks to the unified model, we propose a novel study on multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits in particular for out-of-distribution generalization. Finally, we motivate unification by showing the synergy between tasks. The model weights and code are released here: https://github.com/mshukor/UnIVAL.

图像、视频、音频和语言任务的统一模型

Unified Model for Image, Video, Audio and Language Tasks

摘要

Support