画像、動画、音声、言語タスクのための統合モデル

要旨

大規模言語モデル（LLMs）は、汎用エージェントの実現という野心的な探求を、もはや空想の域から遠く離れたものにしました。このような汎用モデルを構築する上での主要な障壁は、タスクとモダリティの多様性と異質性です。有望な解決策は統一化であり、多様なタスクとモダリティを一つの統一されたフレームワーク内でサポートすることです。大規模なデータセットでトレーニングされたFlamingo（Alayrac et al., 2022）のような少数の大規模モデルは、2つ以上のモダリティをサポートできますが、現在の中小規模の統一モデルは、通常画像-テキストまたは動画-テキストの2モダリティに限定されています。私たちが問うのは、すべてのモダリティをサポートする統一モデルを効率的に構築することは可能か？という問いです。これに答えるために、私たちはUnIVALを提案します。これは、この野心的な目標に向けた一歩です。大規模なデータセットサイズや数十億のパラメータを持つモデルに頼ることなく、約0.25BパラメータのUnIVALモデルは、2つのモダリティを超えて、テキスト、画像、動画、音声を単一のモデルに統合します。私たちのモデルは、タスクバランスとマルチモーダルカリキュラム学習に基づいて、多くのタスクで効率的に事前学習されます。UnIVALは、画像および動画-テキストタスクにおいて、既存の最先端アプローチと競合する性能を示します。画像および動画-テキストモダリティから学習された特徴表現により、モデルは音声-テキストタスクに微調整された場合でも、音声で事前学習されていないにもかかわらず、競合する性能を達成します。統一モデルのおかげで、異なるマルチモーダルタスクでトレーニングされたモデルの重み補間によるマルチモーダルモデル統合の新たな研究を提案し、特に分布外汎化におけるその利点を示します。最後に、タスク間の相乗効果を示すことで統一化の動機付けを行います。モデルの重みとコードはこちらで公開されています：https://github.com/mshukor/UnIVAL。

English

Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities, usually image-text or video-text. The question that we ask is: is it possible to build efficiently a unified model that can support all modalities? To answer this, we propose UnIVAL, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. UnIVAL shows competitive performance to existing state-of-the-art approaches, across image and video-text tasks. The feature representations learned from image and video-text modalities, allows the model to achieve competitive performance when finetuned on audio-text tasks, despite not being pretrained on audio. Thanks to the unified model, we propose a novel study on multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits in particular for out-of-distribution generalization. Finally, we motivate unification by showing the synergy between tasks. The model weights and code are released here: https://github.com/mshukor/UnIVAL.

画像、動画、音声、言語タスクのための統合モデル

Unified Model for Image, Video, Audio and Language Tasks

要旨

Support