QLIP：テキストに整列したビジュアルトークン化は、自己回帰的多モーダル理解と生成を統一します。

要旨

我々は、最先端の再構成品質と最先端のゼロショット画像理解を組み合わせた、Quantized Language-Image Pretraining（QLIP）という視覚トークン化手法を紹介します。QLIPは、再構成と言語-画像の整合性の目的を持つバイナリ球面量子化ベースのオートエンコーダを訓練します。我々は、これら2つの目的が相いれない必要はないことを初めて示しました。訓練中に2つの損失項目を動的にバランスさせ、画像-言語の事前訓練の大規模バッチ要件と再構成目的によって課せられるメモリボトルネックを効果的に組み合わせるための2段階の訓練パイプラインが効果的であることを示しました。QLIPの有効性を検証し、マルチモーダル理解とテキスト条件付き画像生成のための単一モデルとしてのQLIPの性能を示します。具体的には、QLIPは、LLaVAのビジュアルエンコーダやLlamaGenの画像トークナイザーの代替として、同等またはそれ以上の性能で機能します。最後に、QLIPが理解と生成のための統一された混合モダリティ自己回帰モデルを実現することを示します。

English

We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.

QLIP：テキストに整列したビジュアルトークン化は、自己回帰的多モーダル理解と生成を統一します。

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

要旨

Support