将视觉基础编码器与扩散模型的标记器对齐

摘要

在本研究中，我们提出了一种将预训练视觉编码器对齐作为潜在扩散模型在图像生成中的分词器的方法。与从头训练变分自编码器（VAE）主要关注低层次细节不同，我们的方法充分利用了基础编码器丰富的语义结构。我们引入了一种三阶段对齐策略：（1）冻结编码器并训练适配器和解码器，以构建语义潜在空间；（2）通过引入额外的语义保持损失联合优化所有组件，使编码器在捕捉感知细节的同时保留高层次语义；（3）优化解码器以提升重建质量。这种对齐过程产生了语义丰富的图像分词器，为扩散模型带来了显著优势。在ImageNet 256×256数据集上，我们的分词器加速了扩散模型的收敛，仅用64个周期就达到了1.90的gFID，并在有无分类器自由引导的情况下均提升了生成效果。扩展到LAION数据集，一个使用我们分词器训练的20亿参数文本到图像模型，在相同训练步数下持续优于FLUX VAE。总体而言，我们的方法简洁、可扩展，并为连续分词器设计确立了语义基础的新范式。

English

In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders. We introduce a three-stage alignment strategy: (1) freeze the encoder and train an adapter and a decoder to establish a semantic latent space; (2) jointly optimize all components with an additional semantic preservation loss, enabling the encoder to capture perceptual details while retaining high-level semantics; and (3) refine the decoder for improved reconstruction quality. This alignment yields semantically rich image tokenizers that benefit diffusion models. On ImageNet 256times256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs, and improves generation both with and without classifier-free guidance. Scaling to LAION, a 2B-parameter text-to-image model trained with our tokenizer consistently outperforms FLUX VAE under the same training steps. Overall, our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.

将视觉基础编码器与扩散模型的标记器对齐

Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

摘要

Support