BLIP3-o: 完全にオープンな統一マルチモーダルモデルファミリー - アーキテクチャ、トレーニング、データセット

要旨

画像理解と生成の統合は、近年のマルチモーダルモデル研究において注目を集めている。画像理解のための設計選択は広く研究されているが、画像生成を含む統合フレームワークにおける最適なモデルアーキテクチャとトレーニング手法は未だ十分に検討されていない。自己回帰モデルと拡散モデルが高品質な生成とスケーラビリティにおいて強力な可能性を秘めていることに着目し、本研究では、これらのモデルを統合マルチモーダル設定で使用する際の包括的な研究を行い、特に画像表現、モデリング目的、およびトレーニング戦略に焦点を当てた。これらの調査に基づき、我々は、従来のVAEベースの表現とは対照的に、拡散トランスフォーマーを用いて意味的に豊かなCLIP画像特徴を生成する新たなアプローチを提案する。この設計により、トレーニング効率の向上と生成品質の改善がもたらされる。さらに、統合モデルに対する逐次的な事前学習戦略―まず画像理解をトレーニングし、その後画像生成をトレーニングする―が、画像理解能力を維持しながら強力な画像生成能力を開発する上で実用的な利点を提供することを示す。最後に、GPT-4oに多様なキャプションをプロンプトとして与えることで、様々なシーン、物体、人間のジェスチャーなどを網羅した高品質な指示チューニングデータセットBLIP3o-60kを慎重に作成した。我々の革新的なモデル設計、トレーニング手法、およびデータセットに基づいて、BLIP3-oという最先端の統合マルチモーダルモデル群を開発した。BLIP3-oは、画像理解と生成タスクにわたる主要なベンチマークのほとんどで優れた性能を達成する。今後の研究を促進するため、コード、モデル重み、トレーニングスクリプト、事前学習および指示チューニングデータセットを含むモデルを完全にオープンソースとして公開する。

English

Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.

BLIP3-o: 完全にオープンな統一マルチモーダルモデルファミリー - アーキテクチャ、トレーニング、データセット

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

要旨

Support