利用改进的视觉语言推理增强CLIP

摘要

像CLIP这样的图像文本对比模型对各种下游应用非常有用，包括零样本分类、图像文本检索和迁移学习。然而，这些经过对比训练的视觉-语言模型在诸如Winoground之类的组合视觉-语言任务上通常表现不佳，其性能相当于随机猜测。在我们的论文中，我们解决了这个问题，并提出了一种名为SDS-CLIP的样本高效轻量级方法，以提高CLIP的组合视觉-语言推理能力。我们方法的核心思想是利用可微分的图像参数化，通过从大型文本到图像生成模型（如Stable-Diffusion）的蒸馏目标对CLIP进行微调，这些模型在视觉-语言推理任务中表现相对较好。在具有挑战性的Winoground组合推理基准测试中，我们的方法将不同CLIP模型的绝对视觉-语言性能提高了高达7％，而在ARO数据集上，我们的方法将视觉-语言性能提高了高达3％。通过将视觉-语言推理引入CLIP的副产品，我们还发现零样本性能在各种下游数据集上略有改善。我们的方法强调了精心设计的蒸馏目标可以被利用来扩展现有的对比图像-文本模型，从而提高其视觉-语言推理能力。

English

Image-text contrastive models such as CLIP are useful for a variety of downstream applications including zero-shot classification, image-text retrieval and transfer learning. However, these contrastively trained vision-language models often fail on compositional visio-linguistic tasks such as Winoground with performance equivalent to random chance. In our paper, we address this issue and propose a sample-efficient light-weight method called SDS-CLIP to improve the compositional visio-linguistic reasoning capabilities of CLIP. The core idea of our method is to use differentiable image parameterizations to fine-tune CLIP with a distillation objective from large text-to-image generative models such as Stable-Diffusion which are relatively good at visio-linguistic reasoning tasks. On the challenging Winoground compositional reasoning benchmark, our method improves the absolute visio-linguistic performance of different CLIP models by up to 7%, while on the ARO dataset, our method improves the visio-linguistic performance by upto 3%. As a byproduct of inducing visio-linguistic reasoning into CLIP, we also find that the zero-shot performance improves marginally on a variety of downstream datasets. Our method reinforces that carefully designed distillation objectives from generative models can be leveraged to extend existing contrastive image-text models with improved visio-linguistic reasoning capabilities.

利用改进的视觉语言推理增强CLIP

Augmenting CLIP with Improved Visio-Linguistic Reasoning

摘要

Support