ChatPaper.aiChatPaper

QLIP:文本對齊視覺標記統一自回歸多模式理解與生成

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

February 7, 2025
作者: Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Krähenbühl, De-An Huang
cs.AI

摘要

我們介紹了Quantized Language-Image Pretraining (QLIP),這是一種視覺記號化方法,結合了最先進的重建品質和最先進的零-shot圖像理解。QLIP通過訓練基於二元球面量化的自編碼器,同時具有重建和語言-圖像對齊目標。我們首次展示了這兩個目標並不需要相互矛盾。在訓練期間動態平衡這兩個損失項,並展示了兩階段訓練流程有效地混合了圖像-語言預訓練的大批量需求和重建目標所施加的記憶瓶頸。我們驗證了QLIP對於多模態理解和以文本為條件的圖像生成的有效性,並使用單一模型。具體來說,QLIP可作為LLaVA的視覺編碼器和LlamaGen的圖像記號化器的插入式替代,性能相當甚至更好。最後,我們展示了QLIP實現了一個統一的混合模態自回歸模型,用於理解和生成。
English
We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.

Summary

AI-Generated Summary

PDF102February 10, 2025