TTS-VAR: 시각적 자기회귀 생성을 위한 테스트 시간 스케일링 프레임워크

초록

실세계 콘텐츠 생성에 있어 시각적 생성 모델의 확장은 필수적이지만, 상당한 훈련 및 계산 비용을 요구한다. 이에 반해, 테스트 시점 확장은 자원 효율성과 유망한 성능으로 인해 점점 더 주목받고 있다. 본 연구에서는 시각적 자기회귀(VAR) 모델을 위한 첫 번째 일반적인 테스트 시점 확장 프레임워크인 TTS-VAR를 제안하며, 생성 과정을 경로 탐색 문제로 모델링한다. 계산 효율성과 탐색 능력을 동적으로 균형 있게 조절하기 위해, 우리는 먼저 인과적 생성 과정 전반에 걸쳐 적응형 감소 배치 크기 스케줄을 도입한다. 또한, VAR의 계층적 coarse-to-fine 다중 스케일 생성에서 영감을 받아, 우리의 프레임워크는 두 가지 주요 구성 요소를 통합한다: (i) coarse 스케일에서는 생성된 토큰이 평가하기 어려워 열등한 샘플의 잘못된 수용이나 우수한 샘플의 거부로 이어질 수 있음을 관찰한다. coarse 스케일이 충분한 구조적 정보를 포함하고 있음을 인식하여, 우리는 클러스터링 기반 다양성 탐색을 제안한다. 이는 의미적 특징 클러스터링을 통해 구조적 다양성을 보존하며, 잠재력이 높은 샘플에 대한 후속 선택을 가능하게 한다. (ii) fine 스케일에서는, 재샘플링 기반 잠재력 선택이 다중 스케일 생성 이력을 포함한 보상 함수로 정의된 잠재력 점수를 사용하여 유망한 후보를 우선시한다. 강력한 VAR 모델인 Infinity에 대한 실험에서 GenEval 점수가 0.69에서 0.75로 8.7%의 주목할 만한 향상을 보였다. 주요 통찰은 초기 단계의 구조적 특징이 최종 품질에 효과적으로 영향을 미치며, 재샘플링 효율성이 생성 스케일 간에 다양하다는 것을 보여준다. 코드는 https://github.com/ali-vilab/TTS-VAR에서 확인할 수 있다.

English

Scaling visual generation models is essential for real-world content creation, yet requires substantial training and computational expenses. Alternatively, test-time scaling has garnered growing attention due to resource efficiency and promising performance. In this work, we present TTS-VAR, the first general test-time scaling framework for visual auto-regressive (VAR) models, modeling the generation process as a path searching problem. To dynamically balance computational efficiency with exploration capacity, we first introduce an adaptive descending batch size schedule throughout the causal generation process. Besides, inspired by VAR's hierarchical coarse-to-fine multi-scale generation, our framework integrates two key components: (i) At coarse scales, we observe that generated tokens are hard for evaluation, possibly leading to erroneous acceptance of inferior samples or rejection of superior samples. Noticing that the coarse scales contain sufficient structural information, we propose clustering-based diversity search. It preserves structural variety through semantic feature clustering, enabling later selection on samples with higher potential. (ii) In fine scales, resampling-based potential selection prioritizes promising candidates using potential scores, which are defined as reward functions incorporating multi-scale generation history. Experiments on the powerful VAR model Infinity show a notable 8.7% GenEval score improvement (from 0.69 to 0.75). Key insights reveal that early-stage structural features effectively influence final quality, and resampling efficacy varies across generation scales. Code is available at https://github.com/ali-vilab/TTS-VAR.

TTS-VAR: 시각적 자기회귀 생성을 위한 테스트 시간 스케일링 프레임워크

TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation

초록

Support