MergeVQ：分離可能なトークン統合と量子化による視覚生成と表現の統一フレームワーク

要旨

ベクトル量子化（VQ）を用いたマスク画像モデリング（MIM）は、自己教師あり事前学習と画像生成の両方で大きな成功を収めています。しかし、既存の手法の多くは、生成品質と表現学習および効率性の間のトレードオフを共有潜在空間で解決するのに苦労しています。このパラダイムの限界を押し広げるため、我々はMergeVQを提案します。MergeVQは、トークン統合技術をVQベースの生成モデルに組み込むことで、画像生成と視覚表現学習のギャップを統一アーキテクチャで橋渡しします。事前学習中、MergeVQはエンコーダの自己注意ブロック後にトークン統合モジュールを使用してトップkの意味情報を潜在空間から分離し、その後のルックアップフリー量子化（LFQ）とグローバルアラインメントを行います。また、デコーダのクロスアテンションを通じて細部を復元し、再構築を行います。第二段階の生成では、MergeARを導入し、効率的なラスター順予測のためにKVキャッシュ圧縮を実行します。ImageNetでの大規模な実験により、MergeVQがAR生成モデルとして、視覚表現学習と画像生成タスクの両方で競争力のある性能を発揮しつつ、良好なトークン効率と推論速度を維持することが検証されました。コードとモデルはhttps://apexgen-x.github.io/MergeVQで公開予定です。

English

Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for the second-stage generation, we introduce MergeAR, which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks while maintaining favorable token efficiency and inference speed. The code and model will be available at https://apexgen-x.github.io/MergeVQ.

MergeVQ：分離可能なトークン統合と量子化による視覚生成と表現の統一フレームワーク

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

要旨

Support