CLIPの視覚言語的推論能力を強化する

要旨

CLIPのような画像-テキスト対照モデルは、ゼロショット分類、画像-テキスト検索、転移学習など、さまざまな下流タスクにおいて有用です。しかし、これらの対照学習された視覚-言語モデルは、Winogroundのような合成的視覚-言語タスクではランダムな推測と同等の性能しか発揮できないことがしばしばあります。本論文では、この問題に対処し、CLIPの合成的視覚-言語推論能力を向上させるためのサンプル効率の良い軽量な手法であるSDS-CLIPを提案します。本手法の核心は、Stable-Diffusionのような大規模なテキスト-画像生成モデルからの蒸留目的関数を用いて、CLIPを微分可能な画像パラメータ化によって微調整することです。これらの生成モデルは、視覚-言語推論タスクにおいて比較的優れた性能を発揮します。挑戦的なWinoground合成的推論ベンチマークにおいて、本手法は異なるCLIPモデルの視覚-言語性能を最大7%向上させ、AROデータセットでは最大3%の性能向上を達成しました。CLIPに視覚-言語推論能力を導入する副産物として、さまざまな下流データセットにおけるゼロショット性能もわずかに向上することがわかりました。本手法は、生成モデルから慎重に設計された蒸留目的関数を活用することで、既存の対照的画像-テキストモデルを拡張し、視覚-言語推論能力を向上させることができることを示しています。

English

Image-text contrastive models such as CLIP are useful for a variety of downstream applications including zero-shot classification, image-text retrieval and transfer learning. However, these contrastively trained vision-language models often fail on compositional visio-linguistic tasks such as Winoground with performance equivalent to random chance. In our paper, we address this issue and propose a sample-efficient light-weight method called SDS-CLIP to improve the compositional visio-linguistic reasoning capabilities of CLIP. The core idea of our method is to use differentiable image parameterizations to fine-tune CLIP with a distillation objective from large text-to-image generative models such as Stable-Diffusion which are relatively good at visio-linguistic reasoning tasks. On the challenging Winoground compositional reasoning benchmark, our method improves the absolute visio-linguistic performance of different CLIP models by up to 7%, while on the ARO dataset, our method improves the visio-linguistic performance by upto 3%. As a byproduct of inducing visio-linguistic reasoning into CLIP, we also find that the zero-shot performance improves marginally on a variety of downstream datasets. Our method reinforces that carefully designed distillation objectives from generative models can be leveraged to extend existing contrastive image-text models with improved visio-linguistic reasoning capabilities.

CLIPの視覚言語的推論能力を強化する

Augmenting CLIP with Improved Visio-Linguistic Reasoning

要旨

Support