通过从粗到细的令牌预测提升自回归图像生成
Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction
March 20, 2025
作者: Ziyao Guo, Kaipeng Zhang, Michael Qizhe Shieh
cs.AI
摘要
自回归模型通过借鉴语言建模中的序列预测技术,在图像生成领域取得了显著成功。然而,将这些方法应用于图像需要借助如VQ-VAE等向量量化手段将连续的像素数据离散化。为了缓解VQ-VAE中存在的量化误差,近期研究倾向于使用更大的码本。然而,这相应地扩大了词汇表规模,增加了自回归建模的复杂性。本文旨在探索一种既能享受大码本带来的优势,又不增加自回归建模难度的方法。通过实证研究,我们发现具有相似码字表示的标记对最终生成图像的影响也相似,揭示了大码本中存在显著的冗余性。基于这一洞察,我们提出了从粗到细(CTF)的标记预测策略,即通过为相似标记分配相同的粗粒度标签来实现。我们的框架包含两个阶段:(1) 一个自回归模型,依次预测序列中每个标记的粗粒度标签;(2) 一个辅助模型,在给定粗粒度标签的条件下,同时预测所有标记的细粒度标签。在ImageNet上的实验表明,我们的方法在Inception Score上平均提升了59分,显著优于基线模型。值得注意的是,尽管增加了一个推理步骤,我们的方法仍实现了更快的采样速度。
English
Autoregressive models have shown remarkable success in image generation by
adapting sequential prediction techniques from language modeling. However,
applying these approaches to images requires discretizing continuous pixel data
through vector quantization methods like VQ-VAE. To alleviate the quantization
errors that existed in VQ-VAE, recent works tend to use larger codebooks.
However, this will accordingly expand vocabulary size, complicating the
autoregressive modeling task. This paper aims to find a way to enjoy the
benefits of large codebooks without making autoregressive modeling more
difficult. Through empirical investigation, we discover that tokens with
similar codeword representations produce similar effects on the final generated
image, revealing significant redundancy in large codebooks. Based on this
insight, we propose to predict tokens from coarse to fine (CTF), realized by
assigning the same coarse label for similar tokens. Our framework consists of
two stages: (1) an autoregressive model that sequentially predicts coarse
labels for each token in the sequence, and (2) an auxiliary model that
simultaneously predicts fine-grained labels for all tokens conditioned on their
coarse labels. Experiments on ImageNet demonstrate our method's superior
performance, achieving an average improvement of 59 points in Inception Score
compared to baselines. Notably, despite adding an inference step, our approach
achieves faster sampling speeds.Summary
AI-Generated Summary