디퓨전 모델을 위한 토크나이저에 시각적 기반 인코더 정렬하기

초록

본 연구에서는 사전 학습된 시각 인코더를 정렬하여 이미지 생성에서 잠재 확산 모델(latent diffusion model)의 토크나이저로 활용하는 방법을 제안합니다. 저수준 디테일에 주력하는 변분 오토인코더(VAE)를 처음부터 학습시키는 방식과 달리, 우리의 접근법은 기반 인코더의 풍부한 의미론적 구조를 활용합니다. 우리는 세 단계의 정렬 전략을 도입했습니다: (1) 인코더를 고정하고 어댑터와 디코더를 학습시켜 의미론적 잠재 공간을 구축; (2) 모든 구성 요소를 공동으로 최적화하며 추가적인 의미 보존 손실을 통해 인코더가 지각적 디테일을 포착하면서도 고수준 의미를 유지하도록 함; (3) 재구성 품질을 개선하기 위해 디코더를 미세 조정. 이러한 정렬은 확산 모델에 유익한 의미론적으로 풍부한 이미지 토크나이저를 생성합니다. ImageNet 256×256에서 우리의 토크나이저는 확산 모델의 수렴을 가속화하여 단 64 에포크 만에 gFID 1.90을 달성했으며, 분류자 없는 가이던스 유무에 관계없이 생성 품질을 개선했습니다. LAION으로 확장했을 때, 우리의 토크나이저로 학습된 20억 파라미터 텍스트-이미지 모델은 동일한 학습 단계에서 FLUX VAE를 지속적으로 능가했습니다. 전반적으로, 우리의 방법은 단순하고 확장 가능하며, 연속적인 토크나이저 설계를 위한 의미론적으로 기반을 둔 패러다임을 확립합니다.

English

In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders. We introduce a three-stage alignment strategy: (1) freeze the encoder and train an adapter and a decoder to establish a semantic latent space; (2) jointly optimize all components with an additional semantic preservation loss, enabling the encoder to capture perceptual details while retaining high-level semantics; and (3) refine the decoder for improved reconstruction quality. This alignment yields semantically rich image tokenizers that benefit diffusion models. On ImageNet 256times256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs, and improves generation both with and without classifier-free guidance. Scaling to LAION, a 2B-parameter text-to-image model trained with our tokenizer consistently outperforms FLUX VAE under the same training steps. Overall, our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.

디퓨전 모델을 위한 토크나이저에 시각적 기반 인코더 정렬하기

Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

초록

Support