Unified-IO 2: 視覚、言語、音声、行動を統合した自己回帰型マルチモーダルモデルのスケーリング

要旨

我々は、画像、テキスト、音声、アクションを理解し生成できる初の自己回帰型マルチモーダルモデルであるUnified-IO 2を発表します。異なるモダリティを統合するため、入力と出力（画像、テキスト、音声、アクション、バウンディングボックスなど）を共有の意味空間にトークン化し、単一のエンコーダ-デコーダトランスフォーマーモデルで処理します。これほど多様なモダリティでの学習は困難であるため、モデル学習を安定させるための様々なアーキテクチャ改良を提案します。我々は、多様なソースからなる大規模なマルチモーダル事前学習コーパスを用い、マルチモーダルなデノイザーの混合目的関数でモデルをゼロから学習させます。マルチモーダルな指示に従うなど、幅広いスキルを習得するため、プロンプトと拡張を伴う120のデータセットのアンサンブルを構築し、ファインチューニングを行います。単一の統合モデルであるUnified-IO 2は、GRITベンチマークで最先端の性能を達成し、画像生成と理解、自然言語理解、映像と音声の理解、ロボット操作など35以上のベンチマークで強力な結果を示します。我々は全てのモデルを研究コミュニティに公開します。

English

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.

Unified-IO 2: 視覚、言語、音声、行動を統合した自己回帰型マルチモーダルモデルのスケーリング

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

要旨

Support