拡散フィードバックがCLIPの視覚性能を向上させる

要旨

コントラスティブ言語-画像事前学習（CLIP）は、ドメインやモダリティを超えたオープンワールド表現を抽象化する能力に優れており、様々な視覚およびマルチモーダルタスクの基盤となっています。しかし、最近の研究では、CLIPには方向性、数量、色、構造などをほとんど区別できないといった深刻な視覚的欠点があることが明らかになっています。これらの視覚的欠点は、CLIPを基盤としたマルチモーダル大規模言語モデル（MLLMs）の知覚能力も制限しています。その主な理由は、CLIPの訓練に使用される画像-テキストペアが、テキストの明確さや画像の多様性の欠如により、本質的に偏っているためと考えられます。本研究では、CLIPモデルの視覚的欠点を自己教師あり拡散プロセスを通じて大幅に克服する、シンプルな事後訓練アプローチを提案します。我々は、DIffusionモデルをCLIPのVisual Assistantとして利用するDIVAを紹介します。具体的には、DIVAはテキストから画像への拡散モデルからの生成的フィードバックを活用し、対応するテキストなしで画像のみを用いてCLIP表現を最適化します。DIVAが、細かな視覚能力を大きく評価するMMVP-VLMベンチマークにおいてCLIPの性能を大幅に向上させること（例：3-7%）、およびMLLMsや視覚モデルのマルチモーダル理解とセグメンテーションタスクにおける性能を向上させることを実証します。29の画像分類および検索ベンチマークでの広範な評価により、我々のフレームワークがCLIPの強力なゼロショット能力を維持していることが確認されました。コードはhttps://github.com/baaivision/DIVAで公開予定です。

English

Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the perception capabilities of multimodal large language models (MLLMs) built on CLIP. The main reason could be that the image-text pairs used to train CLIP are inherently biased, due to the lack of the distinctiveness of the text and the diversity of images. In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP's strong zero-shot capabilities. The code will be available at https://github.com/baaivision/DIVA.

拡散フィードバックがCLIPの視覚性能を向上させる

Diffusion Feedback Helps CLIP See Better

要旨

Support