VQRAE:面向多模態理解、生成與重建的表示量化自編碼器
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
November 28, 2025
作者: Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan
cs.AI
摘要
在多模態模型中,實現理解、生成與重建表徵的統一符記化仍是核心難題。既往研究主要基於雙編碼器範式展開探索,例如分別採用獨立編碼器處理理解與生成任務,或通過對比損失平衡語義表徵與低階特徵。本文提出VQRAE(表徵自編碼器的向量量化版本),首次在統一符記化框架下實現連續語義特徵(用於圖像理解)與離散符記(用於視覺生成)的協同表徵。具體而言,我們基於預訓練視覺基礎模型構建對稱ViT解碼器,並採用兩階段訓練策略:第一階段凍結編碼器,以像素重建為目標學習高維語義VQ碼本;第二階段通過自蒸餾約束聯合優化編碼器。該設計既能保留可忽略損耗的語義信息以維持多模態理解能力,又生成兼容生成任務的離散符記與細粒度重建結果。此外,我們發現語義編碼器量化需採用高維碼本(與圖像重建中常用的低維碼本實踐相反)的獨特性質——語義VQ碼本在1536維度下可實現100%利用率。VQRAE在多個視覺理解、生成與重建基準測試中展現競爭力,其離散特性在自回歸範式下具備優良的擴展潛力。
English
Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.