透過粗到細的標記預測提升自回歸圖像生成效能
Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction
March 20, 2025
作者: Ziyao Guo, Kaipeng Zhang, Michael Qizhe Shieh
cs.AI
摘要
自迴歸模型通過借鑑語言建模中的序列預測技術,在圖像生成領域展現了顯著的成功。然而,將這些方法應用於圖像需要通過向量量化方法(如VQ-VAE)對連續的像素數據進行離散化處理。為緩解VQ-VAE中存在的量化誤差,近期研究傾向於使用更大的碼本。然而,這將相應地擴大詞彙量,使自迴歸建模任務變得更加複雜。本文旨在探索一種既能享受大碼本帶來的好處,又不增加自迴歸建模難度的方法。通過實證研究,我們發現具有相似碼字表示的標記對最終生成的圖像產生相似的效果,這揭示了大型碼本中存在顯著的冗餘性。基於這一洞察,我們提出從粗到細(CTF)預測標記的策略,通過為相似標記分配相同的粗粒度標籤來實現。我們的框架包含兩個階段:(1)一個自迴歸模型,依次預測序列中每個標記的粗粒度標籤;(2)一個輔助模型,在給定粗粒度標籤的條件下,同時預測所有標記的細粒度標籤。在ImageNet上的實驗表明,我們的方法表現優異,與基線相比,Inception Score平均提升了59分。值得注意的是,儘管增加了一個推理步驟,我們的方法仍實現了更快的採樣速度。
English
Autoregressive models have shown remarkable success in image generation by
adapting sequential prediction techniques from language modeling. However,
applying these approaches to images requires discretizing continuous pixel data
through vector quantization methods like VQ-VAE. To alleviate the quantization
errors that existed in VQ-VAE, recent works tend to use larger codebooks.
However, this will accordingly expand vocabulary size, complicating the
autoregressive modeling task. This paper aims to find a way to enjoy the
benefits of large codebooks without making autoregressive modeling more
difficult. Through empirical investigation, we discover that tokens with
similar codeword representations produce similar effects on the final generated
image, revealing significant redundancy in large codebooks. Based on this
insight, we propose to predict tokens from coarse to fine (CTF), realized by
assigning the same coarse label for similar tokens. Our framework consists of
two stages: (1) an autoregressive model that sequentially predicts coarse
labels for each token in the sequence, and (2) an auxiliary model that
simultaneously predicts fine-grained labels for all tokens conditioned on their
coarse labels. Experiments on ImageNet demonstrate our method's superior
performance, achieving an average improvement of 59 points in Inception Score
compared to baselines. Notably, despite adding an inference step, our approach
achieves faster sampling speeds.Summary
AI-Generated Summary