保持预训练视觉语言模型的多模态能力，以提高视觉-语言组合性

摘要

本文提出了一种新方法，用于增强预训练视觉和语言模型（VLMs）中的组合理解，而不会牺牲零样本多模态任务的性能。传统的微调方法通常会提高组合推理能力，但会降低多模态能力，主要是由于使用全局硬负（HN）损失，用于对比图像和文本的全局表示。这种全局HN损失会推动与原始文本高度相似的HN文本，损害模型的多模态表示。为了克服这一局限性，我们提出了细粒度选择性校准CLIP（FSC-CLIP），它整合了局部硬负损失和选择性校准正则化。这些创新提供了细粒度的负监督，同时保持了模型的表示完整性。我们在各种基准测试中进行了广泛评估，涵盖了组合性和多模态任务，结果显示FSC-CLIP不仅实现了与最先进模型相媲美的组合性，而且保留了强大的多模态能力。代码可在以下链接获取：https://github.com/ytaek-oh/fsc-clip。

English

In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model's multi-modal representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model's representational integrity. Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities. Code is available at: https://github.com/ytaek-oh/fsc-clip.

保持预训练视觉语言模型的多模态能力，以提高视觉-语言组合性

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

摘要

Support