VQRAE:面向多模态理解、生成与重建的表征量化自编码器
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
November 28, 2025
作者: Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan
cs.AI
摘要
在多模态模型构建中,如何通过单一分词器实现理解、生成与重建表征的统一仍是一个核心挑战。现有研究主要基于双编码器范式进行探索,例如分别采用独立编码器处理理解与生成任务,或通过对比损失平衡语义表征与底层特征。本文提出VQRAE(表征自编码器的向量量化版本),首次在统一分词器框架下探索联合表征——既生成用于图像理解的连续语义特征,又产生适用于视觉生成的离散标记。具体而言,我们在预训练视觉基础模型上引入对称ViT解码器,采用两阶段训练策略:第一阶段冻结编码器,以像素重建为目标学习高维语义VQ码本;第二阶段通过自蒸馏约束联合优化编码器。该设计既能以可忽略的语义损失维持多模态理解能力,又可生成兼容生成任务的离散标记并实现细粒度重建。此外,我们发现语义编码器量化需依赖高维码本(与图像重建中常用的低维码本实践相反)这一有趣特性,所构建的1536维语义VQ码本可实现100%利用率。VQRAE在视觉理解、生成与重建的多项基准测试中展现出竞争力,其离散特性在自回归范式下表现出良好的扩展潜力。
English
Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.