邁向可擴展的生成式視覺標記器預訓練
Towards Scalable Pre-training of Visual Tokenizers for Generation
December 15, 2025
作者: Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang
cs.AI
摘要
視覺標記器(如VAE)的潛在空間品質對現代生成模型至關重要。然而,基於重建的標準訓練範式會產生偏向低階資訊的潛在空間,這導致一個根本性缺陷:更好的像素級精度並不能帶來更高質量的生成效果。這意味著將大量算力投入視覺標記器預訓練對生成性能的提升效果有限。我們將此稱為「預訓練縮放問題」,並提出關鍵轉變:要有效服務生成任務,潛在空間必須簡潔地表達高階語義。我們提出VTP這一統一的視覺標記器預訓練框架,首創性地聯合優化圖文對比、自監督和重建損失。大規模實驗揭示兩大核心發現:(1)理解是驅動生成的關鍵因素;(2)顯著改善的縮放特性——生成性能隨視覺標記器預訓練投入的算力、參數和數據量有效提升。經過大規模預訓練後,我們的標記器實現了競爭性指標(ImageNet上78.2%零樣本準確率和0.36 rFID),並比先進蒸餾方法快4.1倍收斂速度。更重要的是其卓越的縮放能力:在未修改標準DiT訓練配置的情況下,僅增加VTP預訓練的FLOPS投入即可實現下游生成任務65.8%的FID提升,而傳統自編碼器在僅消耗1/10 FLOPS時就早早陷入停滯。預訓練模型已開源於:https://github.com/MiniMax-AI/VTP。
English
The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.