VQRAE: マルチモーダル理解・生成・再構成のための表現量子化オートエンコーダ

要旨

マルチモーダル理解・生成・再構成の表現を単一のトークナイザーに統合することは、統一モデル構築における重要な課題である。従来研究は主にデュアルエンコーダの枠組みでこの問題に取り組み、例えば理解と生成にそれぞれ別個のエンコーダを利用する方法や、対照損失で意味表現と低次元特徴のバランスを取る方法が提案されてきた。本論文では、表現オートエンコーダのベクトル量子化版であるVQRAEを提案する。これは統一トークナイザー内で、画像理解のための連続的意味特徴と視覚生成のための離散トークンを生成する統合表現の先駆的な探求である。具体的には、事前学習済み視覚基盤モデルに対称ViTデコーダを組み合わせ、2段階の学習戦略を採用する。第一段階ではエンコーダを固定し、ピクセル再構成を目的として高次元の意味的VQコードブックを学習する。第二段階では自己蒸留制約を用いてエンコーダを共同最適化する。この設計により、マルチモーダル理解能力を維持するための意味情報損失を無視可能にしつつ、生成と微細な再構成に適した離散トークンを実現する。さらに、画像再構成における従来の低次元コードブックの一般的手法とは対照的に、意味エンコーダの量子化において高次元コードブックに依存する興味深い特性を明らかにした。意味的VQコードブックは1536次元において100%の利用率を達成できる。VQRAEは、視覚理解・生成・再構成の複数ベンチマークで競合する性能を示し、離散性の利点による自己回帰パラダイムにおける良好なスケーリング特性を有する。

English

Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.

VQRAE: マルチモーダル理解・生成・再構成のための表現量子化オートエンコーダ

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

要旨

Support