視覺基礎模型作為自回歸圖像生成的有效視覺標記器
Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation
July 11, 2025
作者: Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, Xiaojuan Qi
cs.AI
摘要
利用預訓練視覺基礎模型——傳統上用於視覺理解——的強大表徵能力,我們探索了一個新方向:直接在這些模型之上構建圖像標記器,這是一個尚未充分開發的領域。具體而言,我們採用一個凍結的視覺基礎模型作為標記器的編碼器。為了提升其效能,我們引入了兩個關鍵組件:(1) 一個區域自適應量化框架,用於減少在規則二維網格上預訓練特徵的冗餘;(2) 一個語義重建目標,使標記器的輸出與基礎模型的表徵保持一致,以保持語義的保真度。基於這些設計,我們提出的圖像標記器VFMTok在圖像重建和生成質量上取得了顯著提升,同時也提高了標記效率。它進一步促進了自回歸(AR)生成——在ImageNet基準測試中達到了2.07的gFID,同時將模型收斂速度加快了三倍,並實現了無需無分類器指導(CFG)的高保真類別條件合成。代碼將公開發布,以惠及社區。
English
Leveraging the powerful representations of pre-trained vision foundation
models -- traditionally used for visual comprehension -- we explore a novel
direction: building an image tokenizer directly atop such models, a largely
underexplored area. Specifically, we employ a frozen vision foundation model as
the encoder of our tokenizer. To enhance its effectiveness, we introduce two
key components: (1) a region-adaptive quantization framework that reduces
redundancy in the pre-trained features on regular 2D grids, and (2) a semantic
reconstruction objective that aligns the tokenizer's outputs with the
foundation model's representations to preserve semantic fidelity. Based on
these designs, our proposed image tokenizer, VFMTok, achieves substantial
improvements in image reconstruction and generation quality, while also
enhancing token efficiency. It further boosts autoregressive (AR) generation --
achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model
convergence by three times, and enabling high-fidelity class-conditional
synthesis without the need for classifier-free guidance (CFG). The code will be
released publicly to benefit the community.