Lumina-Image 2.0:統一且高效的圖像生成框架
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
March 27, 2025
作者: Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, Xiangyang Zhu, Manyuan Zhang, Will Beddow, Erwann Millon, Victor Perez, Wenhai Wang, Conghui He, Bo Zhang, Xiaohong Liu, Hongsheng Li, Yu Qiao, Chang Xu, Peng Gao
cs.AI
摘要
我們推出Lumina-Image 2.0,這是一個先進的文本到圖像生成框架,相比前作Lumina-Next取得了顯著進展。Lumina-Image 2.0基於兩大核心原則構建:(1) 統一性——它採用了一種統一架構(Unified Next-DiT),將文本與圖像標記視為聯合序列處理,促進了自然的跨模態交互,並支持任務的無縫擴展。此外,鑑於高質量的描述生成器能提供語義高度對齊的文本-圖像訓練對,我們引入了一個專為T2I生成任務設計的統一描述系統——Unified Captioner(UniCap)。UniCap擅長生成全面且準確的描述,加速了模型收斂並增強了對提示的遵循度。(2) 效率——為了提升所提出模型的效率,我們開發了多階段漸進式訓練策略,並引入了不損害圖像質量的推理加速技術。在學術基準測試和公開的文本到圖像競技場上的廣泛評估表明,Lumina-Image 2.0即使僅擁有26億參數,也能展現出強大的性能,凸顯了其可擴展性和設計效率。我們已在https://github.com/Alpha-VLLM/Lumina-Image-2.0上公開了訓練細節、代碼及模型。
English
We introduce Lumina-Image 2.0, an advanced text-to-image generation framework
that achieves significant progress compared to previous work, Lumina-Next.
Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts
a unified architecture (Unified Next-DiT) that treats text and image tokens as
a joint sequence, enabling natural cross-modal interactions and allowing
seamless task expansion. Besides, since high-quality captioners can provide
semantically well-aligned text-image training pairs, we introduce a unified
captioning system, Unified Captioner (UniCap), specifically designed for T2I
generation tasks. UniCap excels at generating comprehensive and accurate
captions, accelerating convergence and enhancing prompt adherence. (2)
Efficiency - to improve the efficiency of our proposed model, we develop
multi-stage progressive training strategies and introduce inference
acceleration techniques without compromising image quality. Extensive
evaluations on academic benchmarks and public text-to-image arenas show that
Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters,
highlighting its scalability and design efficiency. We have released our
training details, code, and models at
https://github.com/Alpha-VLLM/Lumina-Image-2.0.Summary
AI-Generated Summary