1D 의미 토크나이저를 활용한 종단 간 자기회귀 이미지 생성

초록

자기회귀 이미지 모델링은 시각적 토크나이저를 통해 이미지를 컴팩트한 잠재 표현으로 압축하는 데 의존합니다. 우리는 재구성과 생성을 공동으로 최적화하는 종단간 학습 파이프라인을 설계하여 생성 결과로부터 토크나이저로의 직접적인 지도 학습이 가능하게 합니다. 이는 토크나이저와 생성 모델을 별도로 학습하는 기존의 2단계 접근법과 대조됩니다. 또한 자기회귀 모델링을 위한 1D 토크나이저 성능 향상을 위해 비전 파운데이션 모델의 활용 방안을 연구합니다. 우리의 자기회귀 생성 모델은 ImageNet 256x256 생성에서 guidance 없이 1.48의 최첨단 FID 점수를 포함하여 강력한 실험 결과를 달성했습니다.

English

Autoregressive image modeling relies on visual tokenizers to compress images into compact latent representations. We design an end-to-end training pipeline that jointly optimizes reconstruction and generation, enabling direct supervision from generation results to the tokenizer. This contrasts with prior two-stage approaches that train tokenizers and generative models separately. We further investigate leveraging vision foundation models to improve 1D tokenizers for autoregressive modeling. Our autoregressive generative model achieves strong empirical results, including a state-of-the-art FID score of 1.48 without guidance on ImageNet 256x256 generation.

1D 의미 토크나이저를 활용한 종단 간 자기회귀 이미지 생성

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

초록

Support