スケーリングされた自己回帰型マルチモーダルモデル：事前学習と指示チューニング

要旨

我々はCM3Leon（「カメレオン」と発音）を紹介する。これは、検索拡張型のトークンベースデコーダ専用マルチモーダル言語モデルであり、テキストと画像の両方の生成とインフィリングが可能である。CM3LeonはCM3マルチモーダルアーキテクチャを使用しているが、さらにスケールアップと多様な指示形式データでのチューニングがもたらす極めて大きな利点を示している。テキスト専用言語モデルから適応したレシピを用いて訓練された初のマルチモーダルモデルであり、大規模な検索拡張型事前学習段階と、第二段階としてのマルチタスク教師ありファインチューニング（SFT）段階を含む。また、テキストから画像、画像からテキストの両方の生成が可能な汎用モデルでもあり、高品質な出力を生成する自己完結型のコントラスティブデコーディング手法を導入することができる。広範な実験により、このレシピがマルチモーダルモデルに対して極めて有効であることが実証されている。CM3Leonは、同等の手法と比べて5分の1の訓練計算量で、テキストから画像生成において最先端の性能を達成している（ゼロショットMS-COCO FID 4.88）。SFT後、CM3Leonは言語誘導型画像編集から画像制御型生成・セグメンテーションに至るタスクにおいて、前例のないレベルの制御性を実証することができる。

English

We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods (zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation.

スケーリングされた自己回帰型マルチモーダルモデル：事前学習と指示チューニング

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

要旨

Support