直觉引领:自回归图像生成中的置信度扩展
Go with Your Gut: Scaling Confidence for Autoregressive Image Generation
September 30, 2025
作者: Harold Haodong Chen, Xianfeng Wu, Wen-Jie Shu, Rongjin Guo, Disen Lan, Harry Yang, Ying-Cong Chen
cs.AI
摘要
测试时缩放(TTS)在提升大型语言模型方面已展现出显著成效,然而其在基于下一令牌预测(NTP)的自回归(AR)图像生成中的应用仍鲜有探索。现有的视觉自回归(VAR)TTS方法,依赖于频繁的部分解码和外部奖励模型,由于中间解码结果固有的不完整性,并不适用于基于NTP的图像生成。为填补这一空白,我们提出了ScalingAR,这是首个专为基于NTP的AR图像生成设计的TTS框架,无需早期解码或辅助奖励。ScalingAR创新性地利用令牌熵作为视觉令牌生成的新信号,并在两个互补的缩放层级上运作:(i)轮廓层级,通过融合内在与条件信号,流式传输校准后的置信状态;(ii)策略层级,利用此状态自适应地终止低置信度轨迹,并动态调度指导以适应当前阶段的调节强度。在通用与组合基准上的实验表明,ScalingAR(1)在GenEval上使基础模型提升了12.5%,在TIIF-Bench上提升了15.2%;(2)高效地减少了62.0%的视觉令牌消耗,同时性能超越基线;(3)显著增强了鲁棒性,在挑战性场景下将性能下降减少了26.0%。
English
Test-time scaling (TTS) has demonstrated remarkable success in enhancing
large language models, yet its application to next-token prediction (NTP)
autoregressive (AR) image generation remains largely uncharted. Existing TTS
approaches for visual AR (VAR), which rely on frequent partial decoding and
external reward models, are ill-suited for NTP-based image generation due to
the inherent incompleteness of intermediate decoding results. To bridge this
gap, we introduce ScalingAR, the first TTS framework specifically designed for
NTP-based AR image generation that eliminates the need for early decoding or
auxiliary rewards. ScalingAR leverages token entropy as a novel signal in
visual token generation and operates at two complementary scaling levels: (i)
Profile Level, which streams a calibrated confidence state by fusing intrinsic
and conditional signals; and (ii) Policy Level, which utilizes this state to
adaptively terminate low-confidence trajectories and dynamically schedule
guidance for phase-appropriate conditioning strength. Experiments on both
general and compositional benchmarks show that ScalingAR (1) improves base
models by 12.5% on GenEval and 15.2% on TIIF-Bench, (2) efficiently reduces
visual token consumption by 62.0% while outperforming baselines, and (3)
successfully enhances robustness, mitigating performance drops by 26.0% in
challenging scenarios.