TTS-VAR: 視覚的オートリグレッシブ生成のためのテストタイムスケーリングフレームワーク

要旨

視覚生成モデルのスケーリングは、実世界のコンテンツ作成において不可欠であるが、多大なトレーニングと計算コストを必要とする。一方で、リソース効率と有望な性能から、テスト時のスケーリングが注目を集めている。本研究では、視覚自己回帰（VAR）モデルにおける初の汎用的なテスト時スケーリングフレームワークであるTTS-VARを提案し、生成プロセスを経路探索問題としてモデル化する。計算効率と探索能力の動的バランスを取るため、因果生成プロセス全体を通じて適応的なバッチサイズ降下スケジュールを導入する。さらに、VARの階層的な粗から細へのマルチスケール生成に着想を得て、本フレームワークは二つの主要コンポーネントを統合する：(i) 粗いスケールでは、生成されたトークンが評価困難であり、劣ったサンプルを誤って受け入れたり、優れたサンプルを拒否したりする可能性がある。粗いスケールには十分な構造情報が含まれていることに着目し、クラスタリングに基づく多様性探索を提案する。これは、セマンティック特徴クラスタリングを通じて構造的多様性を保持し、後でより高い潜在力を持つサンプルを選択可能にする。(ii) 細かいスケールでは、リサンプリングに基づく潜在選択が、マルチスケール生成履歴を組み込んだ報酬関数として定義される潜在スコアを使用して、有望な候補を優先する。強力なVARモデルInfinityでの実験では、GenEvalスコアが8.7%向上（0.69から0.75）した。重要な洞察として、初期段階の構造的特徴が最終品質に効果的に影響を与えること、およびリサンプリングの有効性が生成スケールによって異なることが明らかになった。コードはhttps://github.com/ali-vilab/TTS-VARで公開されている。

English

Scaling visual generation models is essential for real-world content creation, yet requires substantial training and computational expenses. Alternatively, test-time scaling has garnered growing attention due to resource efficiency and promising performance. In this work, we present TTS-VAR, the first general test-time scaling framework for visual auto-regressive (VAR) models, modeling the generation process as a path searching problem. To dynamically balance computational efficiency with exploration capacity, we first introduce an adaptive descending batch size schedule throughout the causal generation process. Besides, inspired by VAR's hierarchical coarse-to-fine multi-scale generation, our framework integrates two key components: (i) At coarse scales, we observe that generated tokens are hard for evaluation, possibly leading to erroneous acceptance of inferior samples or rejection of superior samples. Noticing that the coarse scales contain sufficient structural information, we propose clustering-based diversity search. It preserves structural variety through semantic feature clustering, enabling later selection on samples with higher potential. (ii) In fine scales, resampling-based potential selection prioritizes promising candidates using potential scores, which are defined as reward functions incorporating multi-scale generation history. Experiments on the powerful VAR model Infinity show a notable 8.7% GenEval score improvement (from 0.69 to 0.75). Key insights reveal that early-stage structural features effectively influence final quality, and resampling efficacy varies across generation scales. Code is available at https://github.com/ali-vilab/TTS-VAR.

TTS-VAR: 視覚的オートリグレッシブ生成のためのテストタイムスケーリングフレームワーク

TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation

要旨

Support