蒸馏解码2:基于条件分数蒸馏的图像自回归模型一步采样法
Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation
October 23, 2025
作者: Enshu Liu, Qian Chen, Xuefei Ning, Shengen Yan, Guohao Dai, Zinan Lin, Yu Wang
cs.AI
摘要
图像自回归模型已成为视觉生成模型的重要范式。尽管性能优异,但由于需要大量采样步骤,其生成速度较慢。虽然近期提出的蒸馏解码一代(DD1)实现了图像自回归模型的少步采样,但在单步采样场景下仍存在明显性能下降,且依赖预定义映射限制了灵活性。本研究提出新方法蒸馏解码二代(DD2),进一步推进图像自回归模型单步采样的可行性。与DD1不同,DD2无需依赖预定义映射。我们将原始自回归模型视为教师模型,其在潜在嵌入空间的每个标记位置提供真实条件分数。基于此,提出新颖的条件分数蒸馏损失函数来训练单步生成器:通过训练独立网络预测生成分布的条件分数,并在每个标记位置基于前序标记进行分数蒸馏。实验表明,DD2在ImageNet-256数据集上实现单步采样时,FID指标仅从3.40微增至5.43。相较于最强基线DD1,DD2将单步采样与原始自回归模型的性能差距缩小67%,同时训练速度最高提升12.3倍。DD2向单步自回归生成目标迈出重要一步,为快速高质量的AR建模开辟了新可能。代码已开源于https://github.com/imagination-research/Distilled-Decoding-2。
English
Image Auto-regressive (AR) models have emerged as a powerful paradigm of
visual generative models. Despite their promising performance, they suffer from
slow generation speed due to the large number of sampling steps required.
Although Distilled Decoding 1 (DD1) was recently proposed to enable few-step
sampling for image AR models, it still incurs significant performance
degradation in the one-step setting, and relies on a pre-defined mapping that
limits its flexibility. In this work, we propose a new method, Distilled
Decoding 2 (DD2), to further advances the feasibility of one-step sampling for
image AR models. Unlike DD1, DD2 does not without rely on a pre-defined
mapping. We view the original AR model as a teacher model which provides the
ground truth conditional score in the latent embedding space at each token
position. Based on this, we propose a novel conditional score
distillation loss to train a one-step generator. Specifically, we train a
separate network to predict the conditional score of the generated distribution
and apply score distillation at every token position conditioned on previous
tokens. Experimental results show that DD2 enables one-step sampling for image
AR models with an minimal FID increase from 3.40 to 5.43 on ImageNet-256.
Compared to the strongest baseline DD1, DD2 reduces the gap between the
one-step sampling and original AR model by 67%, with up to 12.3times
training speed-up simultaneously. DD2 takes a significant step toward the goal
of one-step AR generation, opening up new possibilities for fast and
high-quality AR modeling. Code is available at
https://github.com/imagination-research/Distilled-Decoding-2.