BLIP3o-NEXT: ネイティブ画像生成の新たなフロンティア

要旨

BLIP3シリーズの最新作であるBLIP3o-NEXTを紹介します。これは完全にオープンソースの基盤モデルであり、ネイティブ画像生成の新たなフロンティアを切り拓くものです。BLIP3o-NEXTは、テキストから画像への生成と画像編集を単一のアーキテクチャに統合し、強力な画像生成および編集能力を実証しています。最先端のネイティブ画像生成モデルの開発において、私たちは以下の4つの重要な洞察を得ました：(1) ほとんどのアーキテクチャの選択肢は同等の性能を発揮し、効率的にスケールし、高速な推論をサポートするアーキテクチャが有効と見なされること、(2) 強化学習の成功した応用がネイティブ画像生成のフロンティアをさらに押し上げること、(3) 画像編集は依然として困難なタスクであるが、ポストトレーニングとデータエンジンを通じて指示の追従と生成画像と参照画像の一貫性を大幅に向上できること、(4) データの品質と規模がモデル性能の上限を決定する決定的な要因であり続けること。これらの洞察に基づき、BLIP3o-NEXTは、オートリグレッシブモデルがまずマルチモーダル入力を条件に離散的な画像トークンを生成し、その隠れ状態を拡散モデルの条件信号として使用して高精細な画像を生成する「オートリグレッシブ＋拡散」アーキテクチャを採用しています。このアーキテクチャは、オートリグレッシブモデルの推論力と指示追従能力を拡散モデルの微細なディテール描写能力と統合し、新たなレベルの一貫性とリアリズムを実現しています。様々なテキストから画像への生成および画像編集ベンチマークでの広範な評価により、BLIP3o-NEXTが既存のモデルを凌駕する優れた性能を達成していることが示されています。

English

We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.

BLIP3o-NEXT: ネイティブ画像生成の新たなフロンティア

BLIP3o-NEXT: Next Frontier of Native Image Generation

要旨

Support