거친 예측에서 세밀한 예측으로의 토큰 예측을 통한 자기회귀적 이미지 생성 개선

초록

자기회귀 모델은 언어 모델링에서의 순차적 예측 기법을 적용하여 이미지 생성 분야에서 놀라운 성과를 보여왔습니다. 그러나 이러한 접근 방식을 이미지에 적용하기 위해서는 VQ-VAE와 같은 벡터 양자화 방법을 통해 연속적인 픽셀 데이터를 이산화해야 합니다. VQ-VAE에 존재하는 양자화 오류를 완화하기 위해 최근 연구들은 더 큰 코드북을 사용하는 경향이 있습니다. 그러나 이는 어휘 크기를 증가시켜 자기회귀 모델링 작업을 더 복잡하게 만듭니다. 본 논문은 자기회귀 모델링을 더 어렵게 만들지 않으면서도 큰 코드북의 이점을 누릴 수 있는 방법을 찾는 것을 목표로 합니다. 실험적 연구를 통해, 우리는 유사한 코드워드 표현을 가진 토큰들이 최종 생성된 이미지에 유사한 효과를 미친다는 것을 발견했으며, 이는 큰 코드북에 상당한 중복성이 존재함을 보여줍니다. 이러한 통찰을 바탕으로, 우리는 유사한 토큰에 동일한 coarse 레이블을 할당하여 coarse에서 fine으로(CTF) 토큰을 예측하는 방법을 제안합니다. 우리의 프레임워크는 두 단계로 구성됩니다: (1) 시퀀스의 각 토큰에 대해 coarse 레이블을 순차적으로 예측하는 자기회귀 모델, 그리고 (2) coarse 레이블에 조건부로 모든 토큰의 fine-grained 레이블을 동시에 예측하는 보조 모델. ImageNet에서의 실험은 우리의 방법이 우수한 성능을 보이며, 기준 모델 대비 Inception Score에서 평균 59점의 향상을 달성함을 보여줍니다. 특히, 추론 단계가 추가되었음에도 불구하고, 우리의 접근 방식은 더 빠른 샘플링 속도를 달성합니다.

English

Autoregressive models have shown remarkable success in image generation by adapting sequential prediction techniques from language modeling. However, applying these approaches to images requires discretizing continuous pixel data through vector quantization methods like VQ-VAE. To alleviate the quantization errors that existed in VQ-VAE, recent works tend to use larger codebooks. However, this will accordingly expand vocabulary size, complicating the autoregressive modeling task. This paper aims to find a way to enjoy the benefits of large codebooks without making autoregressive modeling more difficult. Through empirical investigation, we discover that tokens with similar codeword representations produce similar effects on the final generated image, revealing significant redundancy in large codebooks. Based on this insight, we propose to predict tokens from coarse to fine (CTF), realized by assigning the same coarse label for similar tokens. Our framework consists of two stages: (1) an autoregressive model that sequentially predicts coarse labels for each token in the sequence, and (2) an auxiliary model that simultaneously predicts fine-grained labels for all tokens conditioned on their coarse labels. Experiments on ImageNet demonstrate our method's superior performance, achieving an average improvement of 59 points in Inception Score compared to baselines. Notably, despite adding an inference step, our approach achieves faster sampling speeds.

거친 예측에서 세밀한 예측으로의 토큰 예측을 통한 자기회귀적 이미지 생성 개선

Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction

초록

Support