ChatPaper.aiChatPaper

自回歸模型勝過擴散:Llama 用於可擴展影像生成

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

June 10, 2024
作者: Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan
cs.AI

摘要

我們介紹了 LlamaGen,這是一個新的圖像生成模型系列,將大型語言模型原創的「下一個標記預測」範式應用到視覺生成領域。這是對於普通自回歸模型(例如 Llama)在視覺信號上沒有歸納偏差的問題,是否可以在適當的擴展下實現最先進的圖像生成性能的肯定回答。我們重新審視了圖像標記器的設計空間、圖像生成模型的可擴展性特性以及它們的訓練數據質量。這次探索的結果包括:(1)一個具有 16 倍採樣比率、0.94 rFID 重建質量和在 ImageNet 基準上使用 97% 碼本的圖像標記器。 (2)一系列從 1.11 億到 31 億參數的類條件圖像生成模型,在 ImageNet 256x256 基準上實現了 2.18 FID,優於流行的擴散模型,如 LDM、DiT。 (3)一個具有 7.75 億參數的文本條件圖像生成模型,通過在 LAION-COCO 和高美學質量圖像上進行兩階段訓練,展示了在視覺質量和文本對齊方面的競爭性性能。 (4)我們驗證了 LLM 服務框架在優化圖像生成模型推理速度方面的有效性,實現了 326% - 414% 的加速。我們釋出所有模型和代碼,以促進視覺生成和多模態基礎模型的開源社區。
English
We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models.

Summary

AI-Generated Summary

PDF713December 8, 2024