un^2CLIP: unCLIPの反転によるCLIPの視覚的詳細捕捉能力の向上

要旨

コントラスティブ言語-画像事前学習（CLIP）は基盤モデルとして確立され、様々な視覚およびマルチモーダルタスクに応用されてきた。しかし、最近の研究では、CLIPが画像の詳細な差異を識別する能力に欠け、密な予測や視覚中心のマルチモーダルタスクにおいて最適な性能を発揮しないことが指摘されている。そこで、本研究は既存のCLIPモデルの改善に焦点を当て、可能な限り多くの視覚的詳細を画像から捉えることを目指す。我々は、特定の種類の生成モデルであるunCLIPが、この目標を達成するための適切なフレームワークを提供することを発見した。具体的には、unCLIPはCLIPの画像埋め込みを条件とした画像生成器を学習する。言い換えれば、CLIPの画像エンコーダを逆転させるものである。CLIPのような識別モデルと比較して、生成モデルは画像のデータ分布を学習するように訓練されるため、画像の詳細をより良く捉えることができる。さらに、unCLIPの条件付き入力空間は、CLIPの元の画像-テキスト埋め込み空間と整合している。したがって、我々はunCLIPを逆転させることで（un^2CLIPと命名）、CLIPモデルを改善することを提案する。この方法により、改善された画像エンコーダはunCLIPの視覚的詳細捕捉能力を獲得しつつ、元のテキストエンコーダとの整合性を同時に維持することができる。我々は、CLIPが適用されてきた様々なタスク、特に挑戦的なMMVP-VLMベンチマーク、密な予測のオープン語彙セグメンテーションタスク、およびマルチモーダル大規模言語モデルタスクにおいて、改善されたCLIPを評価する。実験結果は、un^2CLIPが元のCLIPおよび従来のCLIP改善手法を大幅に上回ることを示している。コードとモデルはhttps://github.com/LiYinqi/un2CLIPで公開予定である。

English

Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images and shows suboptimal performance on dense-prediction and vision-centric multimodal tasks. Therefore, this work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible. We find that a specific type of generative models, unCLIP, provides a suitable framework for achieving our goal. Specifically, unCLIP trains an image generator conditioned on the CLIP image embedding. In other words, it inverts the CLIP image encoder. Compared to discriminative models like CLIP, generative models are better at capturing image details because they are trained to learn the data distribution of images. Additionally, the conditional input space of unCLIP aligns with CLIP's original image-text embedding space. Therefore, we propose to invert unCLIP (dubbed un^2CLIP) to improve the CLIP model. In this way, the improved image encoder can gain unCLIP's visual detail capturing ability while preserving its alignment with the original text encoder simultaneously. We evaluate our improved CLIP across various tasks to which CLIP has been applied, including the challenging MMVP-VLM benchmark, the dense-prediction open-vocabulary segmentation task, and multimodal large language model tasks. Experiments show that un^2CLIP significantly improves the original CLIP and previous CLIP improvement methods. Code and models will be available at https://github.com/LiYinqi/un2CLIP.

un^2CLIP: unCLIPの反転によるCLIPの視覚的詳細捕捉能力の向上

un^2CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

要旨

Support