视觉基础模型作为自回归图像生成的高效视觉分词器

摘要

借助预训练视觉基础模型强大的表征能力——传统上用于视觉理解——我们探索了一个新颖方向：直接在此类模型之上构建图像分词器，这一领域目前尚待充分开发。具体而言，我们采用冻结的视觉基础模型作为分词器的编码器。为提升其效能，我们引入了两个关键组件：(1) 区域自适应量化框架，用于减少预训练特征在规则二维网格上的冗余；(2) 语义重建目标，确保分词器输出与基础模型表征对齐，以保持语义保真度。基于这些设计，我们提出的图像分词器VFMTok在图像重建与生成质量上实现了显著提升，同时提高了分词效率。它进一步推动了自回归（AR）生成——在ImageNet基准测试中达到gFID 2.07，同时加速模型收敛三倍，并无需无分类器指导（CFG）即可实现高保真类别条件合成。代码将公开发布，以惠及社区。

English

Leveraging the powerful representations of pre-trained vision foundation models -- traditionally used for visual comprehension -- we explore a novel direction: building an image tokenizer directly atop such models, a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code will be released publicly to benefit the community.

视觉基础模型作为自回归图像生成的高效视觉分词器

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

摘要

Support