ChatPaper.aiChatPaper

順應直覺:擴展自回歸圖像生成的置信度

Go with Your Gut: Scaling Confidence for Autoregressive Image Generation

September 30, 2025
作者: Harold Haodong Chen, Xianfeng Wu, Wen-Jie Shu, Rongjin Guo, Disen Lan, Harry Yang, Ying-Cong Chen
cs.AI

摘要

測試時縮放(TTS)在增強大型語言模型方面已展現出顯著成效,然而其在基於下一個令牌預測(NTP)的自迴歸(AR)圖像生成中的應用仍鮮有探索。現有的視覺自迴歸(VAR)TTS方法依賴於頻繁的部分解碼和外部獎勵模型,由於中間解碼結果的固有未完成性,這些方法並不適用於基於NTP的圖像生成。為彌合這一差距,我們引入了ScalingAR,這是首個專為基於NTP的AR圖像生成設計的TTS框架,它消除了早期解碼或輔助獎勵的需求。ScalingAR利用令牌熵作為視覺令牌生成中的新信號,並在兩個互補的縮放層面上運作:(i)輪廓層面,通過融合內在和條件信號來流式傳輸校準後的置信狀態;(ii)策略層面,利用這一狀態自適應地終止低置信度軌跡,並動態調度指導以實現階段適宜的條件強度。在通用和組合基準上的實驗表明,ScalingAR(1)在GenEval上將基礎模型提升了12.5%,在TIIF-Bench上提升了15.2%,(2)在超越基線的同時,有效減少了62.0%的視覺令牌消耗,以及(3)成功增強了魯棒性,在挑戰性場景中將性能下降緩解了26.0%。
English
Test-time scaling (TTS) has demonstrated remarkable success in enhancing large language models, yet its application to next-token prediction (NTP) autoregressive (AR) image generation remains largely uncharted. Existing TTS approaches for visual AR (VAR), which rely on frequent partial decoding and external reward models, are ill-suited for NTP-based image generation due to the inherent incompleteness of intermediate decoding results. To bridge this gap, we introduce ScalingAR, the first TTS framework specifically designed for NTP-based AR image generation that eliminates the need for early decoding or auxiliary rewards. ScalingAR leverages token entropy as a novel signal in visual token generation and operates at two complementary scaling levels: (i) Profile Level, which streams a calibrated confidence state by fusing intrinsic and conditional signals; and (ii) Policy Level, which utilizes this state to adaptively terminate low-confidence trajectories and dynamically schedule guidance for phase-appropriate conditioning strength. Experiments on both general and compositional benchmarks show that ScalingAR (1) improves base models by 12.5% on GenEval and 15.2% on TIIF-Bench, (2) efficiently reduces visual token consumption by 62.0% while outperforming baselines, and (3) successfully enhances robustness, mitigating performance drops by 26.0% in challenging scenarios.
PDF82October 3, 2025