尺度自適應VAR實為隱秘的離散擴散
Scale-Wise VAR is Secretly Discrete Diffusion
September 26, 2025
作者: Amandeep Kumar, Nithin Gopalakrishnan Nair, Vishal M. Patel
cs.AI
摘要
自回归(AR)变换器已成为视觉生成领域的一股强大力量,这主要归功于其可扩展性、计算效率以及与语言和视觉相统一的架构。其中,下一代尺度预测视觉自回归生成(VAR)最近展示了卓越的性能,甚至超越了基于扩散的模型。在本研究中,我们重新审视了VAR,并揭示了一个理论洞见:当配备马尔可夫注意力掩码时,VAR在数学上等同于离散扩散。我们将这一重新诠释称为“基于离散扩散的可扩展视觉精炼”(SRDD),从而在AR变换器与扩散模型之间建立了一个原则性的桥梁。利用这一新视角,我们展示了如何直接将扩散的优势,如迭代精炼和减少架构低效性,引入VAR,从而实现更快的收敛、更低的推理成本以及改进的零样本重建。在多个数据集上,我们证明了基于扩散视角的VAR在效率和生成方面均带来了持续的提升。
English
Autoregressive (AR) transformers have emerged as a powerful paradigm for
visual generation, largely due to their scalability, computational efficiency
and unified architecture with language and vision. Among them, next scale
prediction Visual Autoregressive Generation (VAR) has recently demonstrated
remarkable performance, even surpassing diffusion-based models. In this work,
we revisit VAR and uncover a theoretical insight: when equipped with a
Markovian attention mask, VAR is mathematically equivalent to a discrete
diffusion. We term this reinterpretation as Scalable Visual Refinement with
Discrete Diffusion (SRDD), establishing a principled bridge between AR
transformers and diffusion models. Leveraging this new perspective, we show how
one can directly import the advantages of diffusion such as iterative
refinement and reduce architectural inefficiencies into VAR, yielding faster
convergence, lower inference cost, and improved zero-shot reconstruction.
Across multiple datasets, we show that the diffusion based perspective of VAR
leads to consistent gains in efficiency and generation.