Lumina-Image 2.0：統一且高效的圖像生成框架

摘要

我們推出Lumina-Image 2.0，這是一個先進的文本到圖像生成框架，相比前作Lumina-Next取得了顯著進展。Lumina-Image 2.0基於兩大核心原則構建：(1) 統一性——它採用了一種統一架構（Unified Next-DiT），將文本與圖像標記視為聯合序列處理，促進了自然的跨模態交互，並支持任務的無縫擴展。此外，鑑於高質量的描述生成器能提供語義高度對齊的文本-圖像訓練對，我們引入了一個專為T2I生成任務設計的統一描述系統——Unified Captioner（UniCap）。UniCap擅長生成全面且準確的描述，加速了模型收斂並增強了對提示的遵循度。(2) 效率——為了提升所提出模型的效率，我們開發了多階段漸進式訓練策略，並引入了不損害圖像質量的推理加速技術。在學術基準測試和公開的文本到圖像競技場上的廣泛評估表明，Lumina-Image 2.0即使僅擁有26億參數，也能展現出強大的性能，凸顯了其可擴展性和設計效率。我們已在https://github.com/Alpha-VLLM/Lumina-Image-2.0上公開了訓練細節、代碼及模型。

English

We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks. UniCap excels at generating comprehensive and accurate captions, accelerating convergence and enhancing prompt adherence. (2) Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies and introduce inference acceleration techniques without compromising image quality. Extensive evaluations on academic benchmarks and public text-to-image arenas show that Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters, highlighting its scalability and design efficiency. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-Image-2.0.