VQRAE：面向多模态理解、生成与重建的表征量化自编码器

摘要

在多模态模型构建中，如何通过单一分词器实现理解、生成与重建表征的统一仍是一个核心挑战。现有研究主要基于双编码器范式进行探索，例如分别采用独立编码器处理理解与生成任务，或通过对比损失平衡语义表征与底层特征。本文提出VQRAE（表征自编码器的向量量化版本），首次在统一分词器框架下探索联合表征——既生成用于图像理解的连续语义特征，又产生适用于视觉生成的离散标记。具体而言，我们在预训练视觉基础模型上引入对称ViT解码器，采用两阶段训练策略：第一阶段冻结编码器，以像素重建为目标学习高维语义VQ码本；第二阶段通过自蒸馏约束联合优化编码器。该设计既能以可忽略的语义损失维持多模态理解能力，又可生成兼容生成任务的离散标记并实现细粒度重建。此外，我们发现语义编码器量化需依赖高维码本（与图像重建中常用的低维码本实践相反）这一有趣特性，所构建的1536维语义VQ码本可实现100%利用率。VQRAE在视觉理解、生成与重建的多项基准测试中展现出竞争力，其离散特性在自回归范式下表现出良好的扩展潜力。

English

Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.

VQRAE：面向多模态理解、生成与重建的表征量化自编码器

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

摘要

Support