시각적 자기회귀 모델링: 다음 스케일 예측을 통한 확장 가능한 이미지 생성

초록

우리는 Visual AutoRegressive modeling(VAR)을 소개합니다. 이는 이미지에 대한 자기회귀 학습을 표준적인 래스터 스캔 방식의 "다음 토큰 예측"에서 벗어나, "다음 스케일 예측" 또는 "다음 해상도 예측"이라는 거친 단계에서 세밀한 단계로의 접근으로 재정의한 새로운 세대의 패러다임입니다. 이 간단하고 직관적인 방법론은 자기회귀(AR) 트랜스포머가 시각적 분포를 빠르게 학습하고 잘 일반화할 수 있게 합니다: VAR은 처음으로 AR 모델이 이미지 생성에서 확산 트랜스포머(Diffusion Transformer)를 능가하게 합니다. ImageNet 256x256 벤치마크에서 VAR은 AR 기준선을 크게 개선하여 Frechet inception distance(FID)를 18.65에서 1.80으로, inception score(IS)를 80.4에서 356.4로 향상시켰으며, 추론 속도는 약 20배 빨라졌습니다. 또한 VAR이 이미지 품질, 추론 속도, 데이터 효율성, 확장성 등 여러 차원에서 Diffusion Transformer(DiT)를 능가한다는 것이 실증적으로 검증되었습니다. VAR 모델을 확장하면 LLM에서 관찰된 것과 유사한 명확한 파워 법칙 스케일링 법칙이 나타나며, 선형 상관 계수가 -0.998에 가까운 것이 확실한 증거입니다. VAR은 이미지 인페인팅, 아웃페인팅, 편집 등 다운스트림 작업에서 제로샷 일반화 능력을 추가로 보여줍니다. 이러한 결과는 VAR이 LLM의 두 가지 중요한 특성인 스케일링 법칙과 제로샷 작업 일반화를 초기에 모방했음을 시사합니다. 우리는 시각적 생성과 통합 학습을 위한 AR/VAR 모델의 탐구를 촉진하기 위해 모든 모델과 코드를 공개했습니다.

English

We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.80, inception score (IS) from 80.4 to 356.4, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.

시각적 자기회귀 모델링: 다음 스케일 예측을 통한 확장 가능한 이미지 생성

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

초록

Support