對齊視覺基礎編碼器與擴散模型的分詞器

摘要

在本研究中，我們提出將預訓練的視覺編碼器對齊，作為圖像生成中潛在擴散模型的標記器。與從頭訓練變分自編碼器（VAE）主要強調低層次細節不同，我們的方法利用了基礎編碼器豐富的語義結構。我們引入了一種三階段對齊策略：（1）凍結編碼器並訓練適配器和解碼器，以建立語義潛在空間；（2）通過額外的語義保留損失聯合優化所有組件，使編碼器能夠捕捉感知細節的同時保留高層次語義；（3）精煉解碼器以提高重建質量。這種對齊產生了語義豐富的圖像標記器，對擴散模型有益。在ImageNet 256×256上，我們的標記器加速了擴散模型的收斂，僅在64個epoch內就達到了1.90的gFID，並在有和無分類器自由引導的情況下均改善了生成效果。擴展到LAION，一個使用我們標記器訓練的20億參數文本到圖像模型，在相同的訓練步驟下始終優於FLUX VAE。總體而言，我們的方法簡單、可擴展，並為連續標記器設計建立了語義基礎的範式。

English

In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders. We introduce a three-stage alignment strategy: (1) freeze the encoder and train an adapter and a decoder to establish a semantic latent space; (2) jointly optimize all components with an additional semantic preservation loss, enabling the encoder to capture perceptual details while retaining high-level semantics; and (3) refine the decoder for improved reconstruction quality. This alignment yields semantically rich image tokenizers that benefit diffusion models. On ImageNet 256times256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs, and improves generation both with and without classifier-free guidance. Scaling to LAION, a 2B-parameter text-to-image model trained with our tokenizer consistently outperforms FLUX VAE under the same training steps. Overall, our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.

對齊視覺基礎編碼器與擴散模型的分詞器

Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

摘要

Support