ChatPaper.aiChatPaper

VQRAE:面向多模态理解、生成与重建的表征量化自编码器

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

November 28, 2025
作者: Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan
cs.AI

摘要

在多模态模型构建中,如何通过单一分词器实现理解、生成与重建表征的统一仍是一个核心挑战。现有研究主要基于双编码器范式进行探索,例如分别采用独立编码器处理理解与生成任务,或通过对比损失平衡语义表征与底层特征。本文提出VQRAE(表征自编码器的向量量化版本),首次在统一分词器框架下探索联合表征——既生成用于图像理解的连续语义特征,又产生适用于视觉生成的离散标记。具体而言,我们在预训练视觉基础模型上引入对称ViT解码器,采用两阶段训练策略:第一阶段冻结编码器,以像素重建为目标学习高维语义VQ码本;第二阶段通过自蒸馏约束联合优化编码器。该设计既能以可忽略的语义损失维持多模态理解能力,又可生成兼容生成任务的离散标记并实现细粒度重建。此外,我们发现语义编码器量化需依赖高维码本(与图像重建中常用的低维码本实践相反)这一有趣特性,所构建的1536维语义VQ码本可实现100%利用率。VQRAE在视觉理解、生成与重建的多项基准测试中展现出竞争力,其离散特性在自回归范式下表现出良好的扩展潜力。
English
Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.
PDF101December 13, 2025