un^2CLIP：通过反演unCLIP提升CLIP的视觉细节捕捉能力

摘要

对比语言-图像预训练（CLIP）已成为基础模型，并广泛应用于多种视觉和多模态任务中。然而，近期研究表明，CLIP在区分图像细节差异方面表现不足，在密集预测和以视觉为中心的多模态任务上表现欠佳。因此，本研究致力于改进现有CLIP模型，旨在尽可能捕捉图像中的视觉细节。我们发现，一种特定的生成模型——unCLIP，为实现这一目标提供了合适的框架。具体而言，unCLIP基于CLIP图像嵌入训练图像生成器，即对CLIP图像编码器进行逆向操作。与CLIP等判别模型相比，生成模型因需学习图像数据分布，故更擅长捕捉图像细节。此外，unCLIP的条件输入空间与CLIP原有的图像-文本嵌入空间保持一致。因此，我们提出对unCLIP进行逆向操作（称为un^2CLIP）以优化CLIP模型。通过这种方式，改进后的图像编码器既能继承unCLIP捕捉视觉细节的能力，又能保持与原始文本编码器的对齐。我们在CLIP已应用的各种任务上评估了改进后的CLIP，包括具有挑战性的MMVP-VLM基准测试、密集预测的开放词汇分割任务以及多模态大语言模型任务。实验结果显示，un^2CLIP显著优于原始CLIP及先前的CLIP改进方法。代码和模型将发布于https://github.com/LiYinqi/un2CLIP。

English

Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images and shows suboptimal performance on dense-prediction and vision-centric multimodal tasks. Therefore, this work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible. We find that a specific type of generative models, unCLIP, provides a suitable framework for achieving our goal. Specifically, unCLIP trains an image generator conditioned on the CLIP image embedding. In other words, it inverts the CLIP image encoder. Compared to discriminative models like CLIP, generative models are better at capturing image details because they are trained to learn the data distribution of images. Additionally, the conditional input space of unCLIP aligns with CLIP's original image-text embedding space. Therefore, we propose to invert unCLIP (dubbed un^2CLIP) to improve the CLIP model. In this way, the improved image encoder can gain unCLIP's visual detail capturing ability while preserving its alignment with the original text encoder simultaneously. We evaluate our improved CLIP across various tasks to which CLIP has been applied, including the challenging MMVP-VLM benchmark, the dense-prediction open-vocabulary segmentation task, and multimodal large language model tasks. Experiments show that un^2CLIP significantly improves the original CLIP and previous CLIP improvement methods. Code and models will be available at https://github.com/LiYinqi/un2CLIP.

un^2CLIP：通过反演unCLIP提升CLIP的视觉细节捕捉能力

un^2CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

摘要

Support