ChatPaper.aiChatPaper

VQRAE:面向多模態理解、生成與重建的表示量化自編碼器

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

November 28, 2025
作者: Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan
cs.AI

摘要

在多模態模型中,實現理解、生成與重建表徵的統一符記化仍是核心難題。既往研究主要基於雙編碼器範式展開探索,例如分別採用獨立編碼器處理理解與生成任務,或通過對比損失平衡語義表徵與低階特徵。本文提出VQRAE(表徵自編碼器的向量量化版本),首次在統一符記化框架下實現連續語義特徵(用於圖像理解)與離散符記(用於視覺生成)的協同表徵。具體而言,我們基於預訓練視覺基礎模型構建對稱ViT解碼器,並採用兩階段訓練策略:第一階段凍結編碼器,以像素重建為目標學習高維語義VQ碼本;第二階段通過自蒸餾約束聯合優化編碼器。該設計既能保留可忽略損耗的語義信息以維持多模態理解能力,又生成兼容生成任務的離散符記與細粒度重建結果。此外,我們發現語義編碼器量化需採用高維碼本(與圖像重建中常用的低維碼本實踐相反)的獨特性質——語義VQ碼本在1536維度下可實現100%利用率。VQRAE在多個視覺理解、生成與重建基準測試中展現競爭力,其離散特性在自回歸範式下具備優良的擴展潛力。
English
Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.
PDF101December 13, 2025