TTS-VAR：視覺自回歸生成的測試時縮放框架

摘要

視覺生成模型的擴展對於現實世界的內容創作至關重要，但這需要大量的訓練和計算資源。作為替代方案，測試時擴展因其資源效率和出色的性能而受到越來越多的關注。在本研究中，我們提出了TTS-VAR，這是首個針對視覺自迴歸（VAR）模型的通用測試時擴展框架，將生成過程建模為路徑搜索問題。為了在計算效率與探索能力之間實現動態平衡，我們首先在因果生成過程中引入了自適應遞減批量大小調度。此外，受VAR從粗到細的多尺度分層生成啟發，我們的框架整合了兩個關鍵組件：（i）在粗尺度上，我們觀察到生成的標記難以評估，可能導致錯誤地接受劣質樣本或拒絕優質樣本。注意到粗尺度包含足夠的結構信息，我們提出了基於聚類的多樣性搜索。它通過語義特徵聚類來保持結構多樣性，從而能夠在後期選擇具有更高潛力的樣本。（ii）在細尺度上，基於重採樣的潛力選擇利用潛力分數優先考慮有前景的候選者，這些分數被定義為包含多尺度生成歷史的獎勵函數。在強大的VAR模型Infinity上的實驗顯示，GenEval分數顯著提高了8.7%（從0.69提升至0.75）。關鍵洞察表明，早期階段的結構特徵有效影響最終質量，且重採樣效果在生成尺度間存在差異。代碼可在https://github.com/ali-vilab/TTS-VAR獲取。

English

Scaling visual generation models is essential for real-world content creation, yet requires substantial training and computational expenses. Alternatively, test-time scaling has garnered growing attention due to resource efficiency and promising performance. In this work, we present TTS-VAR, the first general test-time scaling framework for visual auto-regressive (VAR) models, modeling the generation process as a path searching problem. To dynamically balance computational efficiency with exploration capacity, we first introduce an adaptive descending batch size schedule throughout the causal generation process. Besides, inspired by VAR's hierarchical coarse-to-fine multi-scale generation, our framework integrates two key components: (i) At coarse scales, we observe that generated tokens are hard for evaluation, possibly leading to erroneous acceptance of inferior samples or rejection of superior samples. Noticing that the coarse scales contain sufficient structural information, we propose clustering-based diversity search. It preserves structural variety through semantic feature clustering, enabling later selection on samples with higher potential. (ii) In fine scales, resampling-based potential selection prioritizes promising candidates using potential scores, which are defined as reward functions incorporating multi-scale generation history. Experiments on the powerful VAR model Infinity show a notable 8.7% GenEval score improvement (from 0.69 to 0.75). Key insights reveal that early-stage structural features effectively influence final quality, and resampling efficacy varies across generation scales. Code is available at https://github.com/ali-vilab/TTS-VAR.