カメレオン：マルチモーダル早期融合基盤モデル

要旨

我々は、任意の順序で画像とテキストを理解・生成可能な早期融合トークンベースの混合モーダルモデル群「Chameleon」を提案する。本論文では、初期段階からの安定したトレーニング手法、アライメント手法、そして早期融合トークンベースの混合モーダル設定に特化したアーキテクチャパラメータ化について概説する。これらのモデルは、視覚的質問応答、画像キャプション生成、テキスト生成、画像生成、長文混合モーダル生成など、幅広いタスクにおいて評価された。Chameleonは、画像キャプション生成タスクにおいて最先端の性能を示し、テキストのみのタスクではLlama-2を上回り、Mixtral 8x7BやGemini-Proなどのモデルと競合する性能を発揮し、さらに非自明な画像生成も単一のモデルで実現している。また、プロンプトまたは出力のいずれかに画像とテキストの混合シーケンスを含む新しい長文混合モーダル生成評価において、人間の判断に基づいてGemini ProやGPT-4Vといったはるかに大規模なモデルの性能に匹敵またはそれを上回る結果を示した。Chameleonは、完全なマルチモーダルドキュメントの統一的なモデリングにおいて重要な一歩を踏み出したものである。

English

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

カメレオン：マルチモーダル早期融合基盤モデル

Chameleon: Mixed-Modal Early-Fusion Foundation Models

要旨

Support