ChatPaper.aiChatPaper

面向可扩展生成式视觉分词器的预训练研究

Towards Scalable Pre-training of Visual Tokenizers for Generation

December 15, 2025
作者: Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang
cs.AI

摘要

视觉分词器(如VAE)的潜在空间质量对现代生成模型至关重要。然而,基于标准重建的训练范式会产生偏向低频信息的潜在空间,这导致一个根本性缺陷:更好的像素级精度并不能带来更高质量的生成结果。这意味着将大量算力投入视觉分词器预训练对生成性能的提升效果有限。我们将此称为"预训练缩放问题",并提出关键转变思路:有效的生成式潜在空间必须能简洁表征高层语义。我们提出VTP这一统一视觉分词器预训练框架,率先实现图像-文本对比、自监督和重建损失的联合优化。大规模实验揭示两个核心发现:(1)理解能力是生成效果的关键驱动力;(2)显著改善的缩放特性——生成性能随预训练分配的算力、参数量和数据规模有效提升。经过大规模预训练,我们的分词器实现竞争优势(ImageNet上78.2%零样本精度和0.36 rFID),生成任务收敛速度比先进蒸馏方法快4.1倍。更重要的是其卓越的缩放性:在保持标准DiT训练配置不变的情况下,仅增加VTP预训练的FLOPS投入即可实现下游生成任务65.8%的FID提升,而传统自编码器在消耗1/10 FLOPS时便过早停滞。预训练模型已发布于https://github.com/MiniMax-AI/VTP。
English
The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.
PDF783December 17, 2025