QLIP : La tokenisation visuelle alignée au texte unifie la compréhension et la génération multimodales auto-régressives.

papers.abstract

Nous introduisons Quantized Language-Image Pretraining (QLIP), une méthode de tokenisation visuelle qui combine une qualité de reconstruction de pointe avec une compréhension d'image sans étiquette de pointe. QLIP entraîne un autoencodeur basé sur une quantification binaire sphérique avec des objectifs de reconstruction et d'alignement langue-image. Nous sommes les premiers à montrer que ces deux objectifs ne doivent pas être en opposition. Nous équilibrions dynamiquement les deux termes de perte pendant l'entraînement et montrons qu'un pipeline d'entraînement en deux étapes mélange efficacement les exigences de grand lot de pré-entraînement image-langue avec le goulot d'étranglement de mémoire imposé par l'objectif de reconstruction. Nous validons l'efficacité de QLIP pour la compréhension multimodale et la génération d'images conditionnées par du texte avec un seul modèle. Plus précisément, QLIP sert de remplacement plug-and-play pour l'encodeur visuel de LLaVA et le tokeniseur d'images pour LlamaGen avec des performances comparables, voire meilleures. Enfin, nous démontrons que QLIP permet un modèle autorégressif mixte unifié pour la compréhension et la génération.

English

We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.

QLIP : La tokenisation visuelle alignée au texte unifie la compréhension et la génération multimodales auto-régressives.

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

papers.abstract

Support