扩散反馈有助于改善CLIP的视觉效果。
Diffusion Feedback Helps CLIP See Better
July 29, 2024
作者: Wenxuan Wang, Quan Sun, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang
cs.AI
摘要
对比语言-图像预训练(CLIP)擅长在跨领域和模态之间抽象开放世界表示,已成为各种视觉和多模态任务的基础。然而,最近的研究揭示了CLIP存在严重的视觉缺陷,几乎无法区分方向、数量、颜色、结构等。这些视觉缺陷也限制了构建在CLIP基础上的多模态大语言模型(MLLMs)的感知能力。主要原因可能是用于训练CLIP的图像-文本对在本质上存在偏见,因为文本的独特性和图像的多样性不足。在这项工作中,我们提出了一种简单的后训练方法,用于通过自监督扩散过程在很大程度上克服CLIP的视觉缺陷。我们介绍了DIVA,它将扩散模型作为CLIP的视觉助手。具体来说,DIVA利用文本到图像扩散模型的生成反馈来优化CLIP表示,只使用图像(没有相应文本)。我们证明DIVA提高了CLIP在具有挑战性的MMVP-VLM基准上的性能,该基准在很大程度上评估了细粒度视觉能力(例如,3-7%),并增强了MLLMs和视觉模型在多模态理解和分割任务上的性能。对29个图像分类和检索基准的广泛评估证实了我们的框架保留了CLIP强大的零样本能力。代码将在https://github.com/baaivision/DIVA 上提供。
English
Contrastive Language-Image Pre-training (CLIP), which excels at abstracting
open-world representations across domains and modalities, has become a
foundation for a variety of vision and multimodal tasks. However, recent
studies reveal that CLIP has severe visual shortcomings, such as which can
hardly distinguish orientation, quantity, color, structure, etc. These visual
shortcomings also limit the perception capabilities of multimodal large
language models (MLLMs) built on CLIP. The main reason could be that the
image-text pairs used to train CLIP are inherently biased, due to the lack of
the distinctiveness of the text and the diversity of images. In this work, we
present a simple post-training approach for CLIP models, which largely
overcomes its visual shortcomings via a self-supervised diffusion process. We
introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP.
Specifically, DIVA leverages generative feedback from text-to-image diffusion
models to optimize CLIP representations, with only images (without
corresponding text). We demonstrate that DIVA improves CLIP's performance on
the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities
to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and
vision models on multimodal understanding and segmentation tasks. Extensive
evaluation on 29 image classification and retrieval benchmarks confirms that
our framework preserves CLIP's strong zero-shot capabilities. The code will be
available at https://github.com/baaivision/DIVA.Summary
AI-Generated Summary