視覺自回歸模型在推理時間縮放上超越擴散模型
Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling
October 19, 2025
作者: Erik Riise, Mehmet Onurcan Kaya, Dim P. Papadopoulos
cs.AI
摘要
儘管通過搜索實現的推理時縮放已徹底改變了大型語言模型,但將這些成果轉化至圖像生成領域卻顯得困難重重。近期嘗試將搜索策略應用於連續擴散模型的研究顯示,其效益有限,簡單的隨機抽樣往往表現最佳。我們證明了視覺自迴歸模型的離散、序列特性能夠有效支持圖像生成中的搜索。研究表明,束搜索顯著提升了文本到圖像的生成質量,使得一個擁有20億參數的自迴歸模型在多項基準測試中超越了120億參數的擴散模型。系統性的消融實驗揭示,這一優勢源自於離散的標記空間,它允許早期剪枝和計算資源的重複利用,而我們的驗證器分析則凸顯了速度與推理能力之間的權衡。這些發現表明,在視覺生成的推理時優化中,模型架構而非僅僅規模,起著至關重要的作用。
English
While inference-time scaling through search has revolutionized Large Language
Models, translating these gains to image generation has proven difficult.
Recent attempts to apply search strategies to continuous diffusion models show
limited benefits, with simple random sampling often performing best. We
demonstrate that the discrete, sequential nature of visual autoregressive
models enables effective search for image generation. We show that beam search
substantially improves text-to-image generation, enabling a 2B parameter
autoregressive model to outperform a 12B parameter diffusion model across
benchmarks. Systematic ablations show that this advantage comes from the
discrete token space, which allows early pruning and computational reuse, and
our verifier analysis highlights trade-offs between speed and reasoning
capability. These findings suggest that model architecture, not just scale, is
critical for inference-time optimization in visual generation.