视觉自回归模型在推理时间扩展性上超越扩散模型
Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling
October 19, 2025
作者: Erik Riise, Mehmet Onurcan Kaya, Dim P. Papadopoulos
cs.AI
摘要
尽管通过搜索实现的推理时扩展已彻底改变了大型语言模型,但将这些成果转化到图像生成领域却面临重重困难。近期尝试将搜索策略应用于连续扩散模型的效果有限,简单的随机采样往往表现最佳。我们证明,视觉自回归模型的离散、序列特性使其在图像生成中能够有效进行搜索。研究表明,束搜索显著提升了文本到图像的生成质量,使一个20亿参数的自回归模型在各项基准测试中超越了120亿参数的扩散模型。系统性的消融实验显示,这一优势源于离散的标记空间,它允许早期剪枝和计算重用,而我们的验证器分析则揭示了速度与推理能力之间的权衡。这些发现表明,在视觉生成的推理时优化中,模型架构而不仅仅是规模,起着至关重要的作用。
English
While inference-time scaling through search has revolutionized Large Language
Models, translating these gains to image generation has proven difficult.
Recent attempts to apply search strategies to continuous diffusion models show
limited benefits, with simple random sampling often performing best. We
demonstrate that the discrete, sequential nature of visual autoregressive
models enables effective search for image generation. We show that beam search
substantially improves text-to-image generation, enabling a 2B parameter
autoregressive model to outperform a 12B parameter diffusion model across
benchmarks. Systematic ablations show that this advantage comes from the
discrete token space, which allows early pruning and computational reuse, and
our verifier analysis highlights trade-offs between speed and reasoning
capability. These findings suggest that model architecture, not just scale, is
critical for inference-time optimization in visual generation.