un^2CLIP:通过反演unCLIP提升CLIP的视觉细节捕捉能力
un^2CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP
May 30, 2025
作者: Yinqi Li, Jiahe Zhao, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen
cs.AI
摘要
對比式語言-圖像預訓練(CLIP)已成為基礎模型,並被應用於多種視覺及多模態任務中。然而,近期研究指出,CLIP在區分圖像細節差異方面表現不足,且在密集預測及以視覺為中心的多模態任務上表現欠佳。因此,本研究致力於改進現有的CLIP模型,旨在盡可能捕捉圖像中的視覺細節。我們發現,特定類型的生成模型——unCLIP,為實現這一目標提供了適宜的框架。具體而言,unCLIP訓練了一個基於CLIP圖像嵌入的圖像生成器,即它反轉了CLIP的圖像編碼器。與CLIP等判別模型相比,生成模型更擅長捕捉圖像細節,因為它們被訓練來學習圖像的數據分佈。此外,unCLIP的條件輸入空間與CLIP原有的圖像-文本嵌入空間相吻合。因此,我們提出反轉unCLIP(稱之為un^2CLIP)以改進CLIP模型。通過這種方式,改進後的圖像編碼器既能獲得unCLIP捕捉視覺細節的能力,又能保持與原文本編碼器的對齊。我們在多種CLIP已應用的任務上評估了改進後的CLIP,包括具有挑戰性的MMVP-VLM基準測試、密集預測的開放詞彙分割任務,以及多模態大語言模型任務。實驗結果顯示,un^2CLIP顯著提升了原始CLIP及先前CLIP改進方法的性能。代碼和模型將於https://github.com/LiYinqi/un2CLIP公開。
English
Contrastive Language-Image Pre-training (CLIP) has become a foundation model
and has been applied to various vision and multimodal tasks. However, recent
works indicate that CLIP falls short in distinguishing detailed differences in
images and shows suboptimal performance on dense-prediction and vision-centric
multimodal tasks. Therefore, this work focuses on improving existing CLIP
models, aiming to capture as many visual details in images as possible. We find
that a specific type of generative models, unCLIP, provides a suitable
framework for achieving our goal. Specifically, unCLIP trains an image
generator conditioned on the CLIP image embedding. In other words, it inverts
the CLIP image encoder. Compared to discriminative models like CLIP, generative
models are better at capturing image details because they are trained to learn
the data distribution of images. Additionally, the conditional input space of
unCLIP aligns with CLIP's original image-text embedding space. Therefore, we
propose to invert unCLIP (dubbed un^2CLIP) to improve the CLIP model. In this
way, the improved image encoder can gain unCLIP's visual detail capturing
ability while preserving its alignment with the original text encoder
simultaneously. We evaluate our improved CLIP across various tasks to which
CLIP has been applied, including the challenging MMVP-VLM benchmark, the
dense-prediction open-vocabulary segmentation task, and multimodal large
language model tasks. Experiments show that un^2CLIP significantly improves
the original CLIP and previous CLIP improvement methods. Code and models will
be available at https://github.com/LiYinqi/un2CLIP.