透過改進的視覺語言推理來擴充 CLIP
Augmenting CLIP with Improved Visio-Linguistic Reasoning
July 18, 2023
作者: Samyadeep Basu, Maziar Sanjabi, Daniela Massiceti, Shell Xu Hu, Soheil Feizi
cs.AI
摘要
像 CLIP 這樣的影像-文字對比模型對於各種下游應用非常有用,包括零樣本分類、影像-文字檢索和遷移學習。然而,這些經過對比訓練的視覺-語言模型在諸如 Winoground 之類的組合視覺-語言任務上通常表現不佳,其性能相當於隨機機會。在我們的論文中,我們解決了這個問題,並提出了一種名為 SDS-CLIP 的範例高效輕量方法,以提升 CLIP 的組合視覺-語言推理能力。我們方法的核心思想是使用可微分的影像參數化來從大型文本到影像生成模型(如 Stable-Diffusion)中進行蒸餾目標,這些模型在組合視覺-語言推理任務上相對較好。在具有挑戰性的 Winoground 組合推理基準測試中,我們的方法將不同 CLIP 模型的絕對視覺-語言性能提高了多達 7%,而在 ARO 資料集上,我們的方法將視覺-語言性能提高了多達 3%。通過將視覺-語言推理引入 CLIP 的副產品,我們還發現零樣本性能在各種下游資料集上略有提升。我們的方法強調了精心設計的從生成模型中提取的蒸餾目標可以用來擴展現有的影像-文字對比模型,從而提升其組合視覺-語言推理能力。
English
Image-text contrastive models such as CLIP are useful for a variety of
downstream applications including zero-shot classification, image-text
retrieval and transfer learning. However, these contrastively trained
vision-language models often fail on compositional visio-linguistic tasks such
as Winoground with performance equivalent to random chance. In our paper, we
address this issue and propose a sample-efficient light-weight method called
SDS-CLIP to improve the compositional visio-linguistic reasoning capabilities
of CLIP. The core idea of our method is to use differentiable image
parameterizations to fine-tune CLIP with a distillation objective from large
text-to-image generative models such as Stable-Diffusion which are relatively
good at visio-linguistic reasoning tasks. On the challenging Winoground
compositional reasoning benchmark, our method improves the absolute
visio-linguistic performance of different CLIP models by up to 7%, while on the
ARO dataset, our method improves the visio-linguistic performance by upto 3%.
As a byproduct of inducing visio-linguistic reasoning into CLIP, we also find
that the zero-shot performance improves marginally on a variety of downstream
datasets. Our method reinforces that carefully designed distillation objectives
from generative models can be leveraged to extend existing contrastive
image-text models with improved visio-linguistic reasoning capabilities.