ChatPaper.aiChatPaper

視覺自回歸建模:透過下一尺度預測的可擴展影像生成

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

April 3, 2024
作者: Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, Liwei Wang
cs.AI

摘要

我們提出了視覺自回歸建模(VAR),這是一種重新定義圖像上自回歸學習的新一代範式,將其定義為粗到細的“下一尺度預測”或“下一解析度預測”,與標準的光柵掃描“下一標記預測”有所不同。這種簡單直觀的方法使得自回歸(AR)變壓器能夠快速學習視覺分佈並具有良好的泛化能力:VAR首次使得AR模型在圖像生成方面超越了擴散變壓器。在ImageNet 256x256基準測試中,VAR通過將Frechet Inception Distance(FID)從18.65提高到1.80,Inception Score(IS)從80.4提高到356.4,並且推理速度提高約20倍,顯著改善了AR基準。實證證實VAR在多個維度上優於擴散變壓器(DiT),包括圖像質量、推理速度、數據效率和可擴展性。擴展VAR模型展現出明顯的冪定律擴展規律,類似於LLMs中觀察到的,具有接近-0.998的線性相關係數作為堅實的證據。VAR進一步展示了在包括圖像修補、外部修補和編輯在內的下游任務中的零樣本泛化能力。這些結果表明VAR已初步模擬了LLMs的兩個重要特性:擴展規律和零樣本任務泛化。我們已經發布了所有模型和代碼,以促進AR/VAR模型在視覺生成和統一學習方面的探索。
English
We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.80, inception score (IS) from 80.4 to 356.4, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.

Summary

AI-Generated Summary

PDF713November 26, 2024