스케일별 VAR는 은밀하게 이산 확산 과정이다

초록

자기회귀(AR) 트랜스포머는 확장성, 계산 효율성, 그리고 언어와 비전을 통합한 아키텍처 덕분에 시각적 생성 분야에서 강력한 패러다임으로 부상했습니다. 이 중에서도 다음 스케일 예측을 기반으로 한 시각적 자기회귀 생성(VAR)은 최근 주목할 만한 성능을 보이며 확산 기반 모델을 능가하기도 했습니다. 본 연구에서는 VAR을 재조명하고 이론적 통찰을 발견했습니다: 마르코프 어텐션 마스크를 장착한 VAR은 수학적으로 이산 확산 모델과 동일합니다. 우리는 이 재해석을 '이산 확산을 통한 확장 가능한 시각적 정제(SRDD)'라고 명명하며, AR 트랜스포머와 확산 모델 간의 원칙적인 연결고리를 확립했습니다. 이 새로운 관점을 활용하여, 우리는 반복적 정제와 같은 확산 모델의 장점을 VAR에 직접 도입하고 아키텍처의 비효율성을 줄여 더 빠른 수렴, 낮은 추론 비용, 그리고 개선된 제로샷 재구성을 달성할 수 있음을 보여줍니다. 여러 데이터셋에 걸쳐, VAR의 확산 기반 관점이 효율성과 생성 품질에서 일관된 개선을 가져온다는 것을 입증합니다.

English

Autoregressive (AR) transformers have emerged as a powerful paradigm for visual generation, largely due to their scalability, computational efficiency and unified architecture with language and vision. Among them, next scale prediction Visual Autoregressive Generation (VAR) has recently demonstrated remarkable performance, even surpassing diffusion-based models. In this work, we revisit VAR and uncover a theoretical insight: when equipped with a Markovian attention mask, VAR is mathematically equivalent to a discrete diffusion. We term this reinterpretation as Scalable Visual Refinement with Discrete Diffusion (SRDD), establishing a principled bridge between AR transformers and diffusion models. Leveraging this new perspective, we show how one can directly import the advantages of diffusion such as iterative refinement and reduce architectural inefficiencies into VAR, yielding faster convergence, lower inference cost, and improved zero-shot reconstruction. Across multiple datasets, we show that the diffusion based perspective of VAR leads to consistent gains in efficiency and generation.

스케일별 VAR는 은밀하게 이산 확산 과정이다

Scale-Wise VAR is Secretly Discrete Diffusion

초록

Support