推测性雅可比去噪解码加速自回归文本到图像生成
Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation
October 10, 2025
作者: Yao Teng, Fuyun Wang, Xian Liu, Zhekai Chen, Han Shi, Yu Wang, Zhenguo Li, Weiyang Liu, Difan Zou, Xihui Liu
cs.AI
摘要
作为一种新兴的视觉内容生成范式,自回归文本到图像模型因其逐令牌顺序解码过程而面临推理速度缓慢的问题,通常需要数千次模型前向传递才能生成单张图像。为解决这一效率瓶颈,我们提出了推测性雅可比去噪解码(SJD2)框架,该框架将去噪过程融入雅可比迭代中,实现了自回归模型中的并行令牌生成。我们的方法引入了一种下一干净令牌预测范式,使预训练的自回归模型能够接受噪声扰动的令牌嵌入,并通过低成本微调预测下一干净令牌。这一去噪范式引导模型沿着更稳定的雅可比轨迹演进。在推理过程中,我们的方法以高斯噪声初始化令牌序列,并在嵌入空间中进行迭代的下一干净令牌预测。我们采用概率准则并行验证并接受多个令牌,同时沿去噪轨迹对未接受的令牌进行下一轮迭代的优化。实验表明,该方法在保持生成图像视觉质量的同时,通过减少模型前向传递次数显著加速了生成过程。
English
As a new paradigm of visual content generation, autoregressive text-to-image
models suffer from slow inference due to their sequential token-by-token
decoding process, often requiring thousands of model forward passes to generate
a single image. To address this inefficiency, we propose Speculative
Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising
process into Jacobi iterations to enable parallel token generation in
autoregressive models. Our method introduces a next-clean-token prediction
paradigm that enables the pre-trained autoregressive models to accept
noise-perturbed token embeddings and predict the next clean tokens through
low-cost fine-tuning. This denoising paradigm guides the model towards more
stable Jacobi trajectories. During inference, our method initializes token
sequences with Gaussian noise and performs iterative
next-clean-token-prediction in the embedding space. We employ a probabilistic
criterion to verify and accept multiple tokens in parallel, and refine the
unaccepted tokens for the next iteration with the denoising trajectory.
Experiments show that our method can accelerate generation by reducing model
forward passes while maintaining the visual quality of generated images.