ChatPaper.aiChatPaper

保留預訓練的多模式 VLMs 的能力,以提升視覺-語言組合性

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

October 7, 2024
作者: Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, In So Kweon, Junmo Kim
cs.AI

摘要

本文提出了一種新方法,用於增強預先訓練的視覺和語言模型(VLMs)對組合理解的能力,同時不影響零-shot多模式任務的性能。傳統的微調方法通常會改善組合推理,但會以降低多模式能力為代價,主要是由於使用全局硬負(HN)損失,這種損失對比圖像和文本的全局表示。這種全局HN損失會推動高度相似於原始文本的HN文本,損壞模型的多模式表示。為了克服這一限制,我們提出了Fine-grained Selective Calibrated CLIP(FSC-CLIP),它整合了局部硬負損失和選擇性校準正則化。這些創新提供了精細的負面監督,同時保持了模型的表示完整性。我們在各種組合性和多模式任務的廣泛評估中表明,FSC-CLIP不僅實現了與最先進模型相當的組合性,還保留了強大的多模式能力。代碼可在以下鏈接找到:https://github.com/ytaek-oh/fsc-clip。
English
In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model's multi-modal representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model's representational integrity. Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities. Code is available at: https://github.com/ytaek-oh/fsc-clip.

Summary

AI-Generated Summary

PDF113November 16, 2024