TTS-VAR：一种面向视觉自回归生成的测试时缩放框架

摘要

扩展视觉生成模型对于现实世界的内容创作至关重要，但同时也伴随着巨大的训练和计算成本。相比之下，测试时扩展因其资源效率和优异性能而日益受到关注。在本研究中，我们提出了TTS-VAR，这是首个针对视觉自回归（VAR）模型的通用测试时扩展框架，将生成过程建模为路径搜索问题。为了在计算效率与探索能力之间实现动态平衡，我们首先在整个因果生成过程中引入了自适应递减的批量大小调度策略。此外，受VAR模型从粗到细的多尺度分层生成启发，我们的框架整合了两个关键组件：(i) 在粗尺度上，我们观察到生成的标记难以评估，可能导致错误地接受劣质样本或拒绝优质样本。注意到粗尺度包含充足的结构信息，我们提出了基于聚类的多样性搜索方法。该方法通过语义特征聚类保持结构多样性，从而为后续选择具有更高潜力的样本奠定基础。(ii) 在细尺度上，基于重采样的潜力选择利用潜力评分优先考虑有前景的候选样本，这些评分定义为结合多尺度生成历史的奖励函数。在强大的VAR模型Infinity上的实验显示，GenEval评分显著提升了8.7%（从0.69增至0.75）。关键发现表明，早期阶段的结构特征有效影响最终质量，且重采样效果随生成尺度的不同而变化。代码已发布于https://github.com/ali-vilab/TTS-VAR。

English

Scaling visual generation models is essential for real-world content creation, yet requires substantial training and computational expenses. Alternatively, test-time scaling has garnered growing attention due to resource efficiency and promising performance. In this work, we present TTS-VAR, the first general test-time scaling framework for visual auto-regressive (VAR) models, modeling the generation process as a path searching problem. To dynamically balance computational efficiency with exploration capacity, we first introduce an adaptive descending batch size schedule throughout the causal generation process. Besides, inspired by VAR's hierarchical coarse-to-fine multi-scale generation, our framework integrates two key components: (i) At coarse scales, we observe that generated tokens are hard for evaluation, possibly leading to erroneous acceptance of inferior samples or rejection of superior samples. Noticing that the coarse scales contain sufficient structural information, we propose clustering-based diversity search. It preserves structural variety through semantic feature clustering, enabling later selection on samples with higher potential. (ii) In fine scales, resampling-based potential selection prioritizes promising candidates using potential scores, which are defined as reward functions incorporating multi-scale generation history. Experiments on the powerful VAR model Infinity show a notable 8.7% GenEval score improvement (from 0.69 to 0.75). Key insights reveal that early-stage structural features effectively influence final quality, and resampling efficacy varies across generation scales. Code is available at https://github.com/ali-vilab/TTS-VAR.

TTS-VAR：一种面向视觉自回归生成的测试时缩放框架

TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation

摘要

Support