자기회귀 모델이 확산 모델을 능가하다: 확장 가능한 이미지 생성을 위한 라마

초록

우리는 대규모 언어 모델의 원래 "다음 토큰 예측" 패러다임을 시각적 생성 영역에 적용한 새로운 이미지 생성 모델 패밀리인 LlamaGen을 소개합니다. 이는 시각적 신호에 대한 귀납적 편향 없이도 일반적인 자기회귀 모델(예: Llama)이 적절하게 스케일링되면 최첨단 이미지 생성 성능을 달성할 수 있는지에 대한 긍정적인 답변입니다. 우리는 이미지 토크나이저의 설계 공간, 이미지 생성 모델의 확장성 특성, 그리고 그들의 훈련 데이터 품질을 재검토했습니다. 이 탐구의 결과는 다음과 같습니다: (1) ImageNet 벤치마크에서 16의 다운샘플 비율, 0.94 rFID의 재구성 품질, 그리고 97%의 코드북 사용률을 가진 이미지 토크나이저. (2) 111M에서 3.1B 파라미터에 이르는 클래스 조건부 이미지 생성 모델 시리즈로, ImageNet 256x256 벤치마크에서 2.18 FID를 달성하여 LDM, DiT와 같은 인기 있는 확산 모델을 능가함. (3) LAION-COCO와 높은 미학적 품질의 이미지에 대한 두 단계 훈련을 통해 얻은 775M 파라미터의 텍스트 조건부 이미지 생성 모델로, 시각적 품질과 텍스트 정렬에서 경쟁력 있는 성능을 보임. (4) LLM 서빙 프레임워크가 이미지 생성 모델의 추론 속도 최적화에 효과적임을 검증하고 326% - 414%의 속도 향상을 달성함. 우리는 시각적 생성 및 다중모달 기반 모델의 오픈소스 커뮤니티를 지원하기 위해 모든 모델과 코드를 공개합니다.

English

We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models.

자기회귀 모델이 확산 모델을 능가하다: 확장 가능한 이미지 생성을 위한 라마

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

초록

Summary

Support

Support