un^2CLIP:通过反演unCLIP提升CLIP的视觉细节捕捉能力
un^2CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP
May 30, 2025
作者: Yinqi Li, Jiahe Zhao, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen
cs.AI
摘要
对比语言-图像预训练(CLIP)已成为基础模型,并广泛应用于多种视觉和多模态任务中。然而,近期研究表明,CLIP在区分图像细节差异方面表现不足,在密集预测和以视觉为中心的多模态任务上表现欠佳。因此,本研究致力于改进现有CLIP模型,旨在尽可能捕捉图像中的视觉细节。我们发现,一种特定的生成模型——unCLIP,为实现这一目标提供了合适的框架。具体而言,unCLIP基于CLIP图像嵌入训练图像生成器,即对CLIP图像编码器进行逆向操作。与CLIP等判别模型相比,生成模型因需学习图像数据分布,故更擅长捕捉图像细节。此外,unCLIP的条件输入空间与CLIP原有的图像-文本嵌入空间保持一致。因此,我们提出对unCLIP进行逆向操作(称为un^2CLIP)以优化CLIP模型。通过这种方式,改进后的图像编码器既能继承unCLIP捕捉视觉细节的能力,又能保持与原始文本编码器的对齐。我们在CLIP已应用的各种任务上评估了改进后的CLIP,包括具有挑战性的MMVP-VLM基准测试、密集预测的开放词汇分割任务以及多模态大语言模型任务。实验结果显示,un^2CLIP显著优于原始CLIP及先前的CLIP改进方法。代码和模型将发布于https://github.com/LiYinqi/un2CLIP。
English
Contrastive Language-Image Pre-training (CLIP) has become a foundation model
and has been applied to various vision and multimodal tasks. However, recent
works indicate that CLIP falls short in distinguishing detailed differences in
images and shows suboptimal performance on dense-prediction and vision-centric
multimodal tasks. Therefore, this work focuses on improving existing CLIP
models, aiming to capture as many visual details in images as possible. We find
that a specific type of generative models, unCLIP, provides a suitable
framework for achieving our goal. Specifically, unCLIP trains an image
generator conditioned on the CLIP image embedding. In other words, it inverts
the CLIP image encoder. Compared to discriminative models like CLIP, generative
models are better at capturing image details because they are trained to learn
the data distribution of images. Additionally, the conditional input space of
unCLIP aligns with CLIP's original image-text embedding space. Therefore, we
propose to invert unCLIP (dubbed un^2CLIP) to improve the CLIP model. In this
way, the improved image encoder can gain unCLIP's visual detail capturing
ability while preserving its alignment with the original text encoder
simultaneously. We evaluate our improved CLIP across various tasks to which
CLIP has been applied, including the challenging MMVP-VLM benchmark, the
dense-prediction open-vocabulary segmentation task, and multimodal large
language model tasks. Experiments show that un^2CLIP significantly improves
the original CLIP and previous CLIP improvement methods. Code and models will
be available at https://github.com/LiYinqi/un2CLIP.Summary
AI-Generated Summary