ChatPaper.aiChatPaper

Lumina-mGPT:具備多模態生成預訓練的靈活照片逼真文本到圖像生成

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

August 5, 2024
作者: Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, Peng Gao
cs.AI

摘要

我們提出 Lumina-mGPT,這是一系列多模態自回歸模型,能夠處理各種視覺和語言任務,特別擅長根據文本描述生成靈活逼真的圖像。與現有的自回歸圖像生成方法不同,Lumina-mGPT採用預訓練的僅解碼器Transformer作為建模多模態標記序列的統一框架。我們的關鍵見解是,一個簡單的僅解碼器Transformer與多模態生成預訓練(mGPT)相結合,利用在龐大交錯的文本-圖像序列上的下一標記預測目標,可以學習廣泛且通用的多模態能力,從而實現逼真的文本到圖像生成。基於這些預訓練模型,我們提出了靈活漸進監督微調(FP-SFT),在高質量的圖像-文本配對上進行,以充分發揮其在高美學圖像合成中的潛力,同時保持其通用的多模態能力。此外,我們引入了全能監督微調(Omni-SFT),將Lumina-mGPT轉變為一個基礎模型,無縫實現全能任務統一。結果顯示,該模型展示了多樣的多模態能力,包括視覺生成任務,如靈活的文本到圖像生成和可控生成,視覺識別任務,如分割和深度估計,以及視覺語言任務,如多輪視覺問答。此外,我們通過直接比較分析了擴散式和自回歸方法之間的差異和相似之處。
English
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. Unlike existing autoregressive image generation approaches, Lumina-mGPT employs a pretrained decoder-only transformer as a unified framework for modeling multimodal token sequences. Our key insight is that a simple decoder-only transformer with multimodal Generative PreTraining (mGPT), utilizing the next-token prediction objective on massive interleaved text-image sequences, can learn broad and general multimodal capabilities, thereby illuminating photorealistic text-to-image generation. Building on these pretrained models, we propose Flexible Progressive Supervised Finetuning (FP-SFT) on high-quality image-text pairs to fully unlock their potential for high-aesthetic image synthesis at any resolution while maintaining their general multimodal capabilities. Furthermore, we introduce Ominiponent Supervised Finetuning (Omni-SFT), transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification. The resulting model demonstrates versatile multimodal capabilities, including visual generation tasks like flexible text-to-image generation and controllable generation, visual recognition tasks like segmentation and depth estimation, and vision-language tasks like multiturn visual question answering. Additionally, we analyze the differences and similarities between diffusion-based and autoregressive methods in a direct comparison.

Summary

AI-Generated Summary

PDF362November 28, 2024