Lumina-Image 2.0：統合された効率的な画像生成フレームワーク

要旨

Lumina-Image 2.0を紹介します。これは、従来のLumina-Nextと比較して大幅な進歩を達成した高度なテキストから画像生成フレームワークです。Lumina-Image 2.0は、以下の2つの主要な原則に基づいて構築されています。(1) 統一性 - テキストと画像トークンを結合されたシーケンスとして扱う統一アーキテクチャ（Unified Next-DiT）を採用し、自然なクロスモーダル相互作用を可能にし、シームレスなタスク拡張を実現します。さらに、高品質なキャプショナーは意味的に整合性の高いテキスト-画像トレーニングペアを提供できるため、T2I生成タスクに特化した統一キャプショニングシステム、Unified Captioner（UniCap）を導入しました。UniCapは包括的で正確なキャプションを生成し、収束を加速し、プロンプトへの忠実度を向上させます。(2) 効率性 - 提案モデルの効率を向上させるため、多段階のプログレッシブトレーニング戦略を開発し、画像品質を損なうことなく推論加速技術を導入しました。学術ベンチマークおよび公開テキストから画像アリーナでの広範な評価により、Lumina-Image 2.0はわずか2.6Bパラメータでも強力な性能を発揮し、そのスケーラビリティと設計効率が強調されています。トレーニングの詳細、コード、およびモデルはhttps://github.com/Alpha-VLLM/Lumina-Image-2.0で公開しています。

English

We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks. UniCap excels at generating comprehensive and accurate captions, accelerating convergence and enhancing prompt adherence. (2) Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies and introduce inference acceleration techniques without compromising image quality. Extensive evaluations on academic benchmarks and public text-to-image arenas show that Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters, highlighting its scalability and design efficiency. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-Image-2.0.

Lumina-Image 2.0：統合された効率的な画像生成フレームワーク

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

要旨

Support