视觉基础模型作为自回归图像生成的高效视觉分词器
Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation
July 11, 2025
作者: Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, Xiaojuan Qi
cs.AI
摘要
借助预训练视觉基础模型强大的表征能力——传统上用于视觉理解——我们探索了一个新颖方向:直接在此类模型之上构建图像分词器,这一领域目前尚待充分开发。具体而言,我们采用冻结的视觉基础模型作为分词器的编码器。为提升其效能,我们引入了两个关键组件:(1) 区域自适应量化框架,用于减少预训练特征在规则二维网格上的冗余;(2) 语义重建目标,确保分词器输出与基础模型表征对齐,以保持语义保真度。基于这些设计,我们提出的图像分词器VFMTok在图像重建与生成质量上实现了显著提升,同时提高了分词效率。它进一步推动了自回归(AR)生成——在ImageNet基准测试中达到gFID 2.07,同时加速模型收敛三倍,并无需无分类器指导(CFG)即可实现高保真类别条件合成。代码将公开发布,以惠及社区。
English
Leveraging the powerful representations of pre-trained vision foundation
models -- traditionally used for visual comprehension -- we explore a novel
direction: building an image tokenizer directly atop such models, a largely
underexplored area. Specifically, we employ a frozen vision foundation model as
the encoder of our tokenizer. To enhance its effectiveness, we introduce two
key components: (1) a region-adaptive quantization framework that reduces
redundancy in the pre-trained features on regular 2D grids, and (2) a semantic
reconstruction objective that aligns the tokenizer's outputs with the
foundation model's representations to preserve semantic fidelity. Based on
these designs, our proposed image tokenizer, VFMTok, achieves substantial
improvements in image reconstruction and generation quality, while also
enhancing token efficiency. It further boosts autoregressive (AR) generation --
achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model
convergence by three times, and enabling high-fidelity class-conditional
synthesis without the need for classifier-free guidance (CFG). The code will be
released publicly to benefit the community.