擴散反饋有助於提升 CLIP 的視覺能力
Diffusion Feedback Helps CLIP See Better
July 29, 2024
作者: Wenxuan Wang, Quan Sun, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang
cs.AI
摘要
對比式語言-圖像預訓練(CLIP)擅長於在不同領域和模態之間抽象出開放世界的表示,已成為各種視覺和多模式任務的基礎。然而,最近的研究揭示了CLIP存在嚴重的視覺缺陷,例如幾乎無法區分方向、數量、顏色、結構等。這些視覺缺陷也限制了建立在CLIP基礎上的多模式大型語言模型(MLLMs)的感知能力。主要原因可能是用於訓練CLIP的圖像-文字對固有地存在偏見,這是由於文字的獨特性和圖像的多樣性不足所致。在這項工作中,我們提出了一種簡單的後訓練方法,用於CLIP模型,通過自監督擴散過程在很大程度上克服其視覺缺陷。我們引入DIVA,它使用擴散模型作為CLIP的視覺助手。具體來說,DIVA利用從文字到圖像擴散模型的生成式反饋來優化CLIP表示,僅使用圖像(沒有對應的文字)。我們展示DIVA在具有挑戰性的MMVP-VLM基準測試中提高了CLIP的性能,該基準測試在很大程度上評估了細粒度視覺能力(例如3-7%),並增強了MLLMs和視覺模型在多模式理解和分割任務上的性能。對29個圖像分類和檢索基準進行了廣泛評估,確認我們的框架保留了CLIP強大的零樣本能力。代碼將可在 https://github.com/baaivision/DIVA 上獲得。
English
Contrastive Language-Image Pre-training (CLIP), which excels at abstracting
open-world representations across domains and modalities, has become a
foundation for a variety of vision and multimodal tasks. However, recent
studies reveal that CLIP has severe visual shortcomings, such as which can
hardly distinguish orientation, quantity, color, structure, etc. These visual
shortcomings also limit the perception capabilities of multimodal large
language models (MLLMs) built on CLIP. The main reason could be that the
image-text pairs used to train CLIP are inherently biased, due to the lack of
the distinctiveness of the text and the diversity of images. In this work, we
present a simple post-training approach for CLIP models, which largely
overcomes its visual shortcomings via a self-supervised diffusion process. We
introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP.
Specifically, DIVA leverages generative feedback from text-to-image diffusion
models to optimize CLIP representations, with only images (without
corresponding text). We demonstrate that DIVA improves CLIP's performance on
the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities
to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and
vision models on multimodal understanding and segmentation tasks. Extensive
evaluation on 29 image classification and retrieval benchmarks confirms that
our framework preserves CLIP's strong zero-shot capabilities. The code will be
available at https://github.com/baaivision/DIVA.Summary
AI-Generated Summary