Lumina-mGPT:利用多模态生成预训练照亮灵活的逼真文本到图像生成
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
August 5, 2024
作者: Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, Peng Gao
cs.AI
摘要
我们介绍了 Lumina-mGPT,这是一系列多模态自回归模型,能够执行各种视觉和语言任务,特别擅长根据文本描述生成灵活逼真的图像。与现有的自回归图像生成方法不同,Lumina-mGPT 使用预训练的仅解码器变压器作为建模多模态标记序列的统一框架。我们的关键洞察是,一个简单的仅解码器变压器与多模态生成预训练(mGPT)相结合,利用大规模交错文本-图像序列上的下一个标记预测目标,可以学习广泛和通用的多模态能力,从而实现逼真的文本到图像生成。基于这些预训练模型,我们提出了灵活渐进监督微调(FP-SFT),在高质量图像-文本对上进行,以充分释放它们在任何分辨率下进行高美学图像合成的潜力,同时保持它们的通用多模态能力。此外,我们引入了全能监督微调(Omni-SFT),将 Lumina-mGPT 转变为一个基础模型,无缝实现全能任务统一。结果模型展示了多样的多模态能力,包括视觉生成任务,如灵活的文本到图像生成和可控生成,视觉识别任务,如分割和深度估计,以及视觉-语言任务,如多轮视觉问答。此外,我们在直接比较中分析了扩散基础和自回归方法之间的差异和相似之处。
English
We present Lumina-mGPT, a family of multimodal autoregressive models capable
of various vision and language tasks, particularly excelling in generating
flexible photorealistic images from text descriptions. Unlike existing
autoregressive image generation approaches, Lumina-mGPT employs a pretrained
decoder-only transformer as a unified framework for modeling multimodal token
sequences. Our key insight is that a simple decoder-only transformer with
multimodal Generative PreTraining (mGPT), utilizing the next-token prediction
objective on massive interleaved text-image sequences, can learn broad and
general multimodal capabilities, thereby illuminating photorealistic
text-to-image generation. Building on these pretrained models, we propose
Flexible Progressive Supervised Finetuning (FP-SFT) on high-quality image-text
pairs to fully unlock their potential for high-aesthetic image synthesis at any
resolution while maintaining their general multimodal capabilities.
Furthermore, we introduce Ominiponent Supervised Finetuning (Omni-SFT),
transforming Lumina-mGPT into a foundation model that seamlessly achieves
omnipotent task unification. The resulting model demonstrates versatile
multimodal capabilities, including visual generation tasks like flexible
text-to-image generation and controllable generation, visual recognition tasks
like segmentation and depth estimation, and vision-language tasks like
multiturn visual question answering. Additionally, we analyze the differences
and similarities between diffusion-based and autoregressive methods in a direct
comparison.Summary
AI-Generated Summary