ChatPaper.aiChatPaper

自回归模型胜过扩散:Llama 用于可扩展图像生成

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

June 10, 2024
作者: Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan
cs.AI

摘要

我们介绍了LlamaGen,这是一系列新的图像生成模型,将大型语言模型的原始“下一个标记预测”范式应用于视觉生成领域。这是对于普通自回归模型(例如Llama)在视觉信号上没有归纳偏差的情况下,如果适当扩展,是否可以实现最先进的图像生成性能的肯定回答。我们重新审视了图像分词器的设计空间、图像生成模型的可扩展性特性以及它们的训练数据质量。这一探索的结果包括:(1)一个图像分词器,下采样比率为16,重建质量为0.94 rFID,在ImageNet基准测试中的码书使用率为97%。 (2)一系列类别条件的图像生成模型,参数范围从1.11亿到31亿,在ImageNet 256x256基准测试中实现2.18 FID,优于流行的扩散模型,如LDM、DiT。 (3)一个文本条件的图像生成模型,参数为7.75亿,经过在LAION-COCO和高审美质量图像上的两阶段训练,展示了在视觉质量和文本对齐方面的竞争性表现。 (4)我们验证了LLM服务框架在优化图像生成模型推断速度方面的有效性,并实现了326%至414%的加速。我们发布所有模型和代码,以促进视觉生成和多模态基础模型的开源社区。
English
We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models.

Summary

AI-Generated Summary

PDF713December 8, 2024