사전 훈련된 VLM의 다중 모달 능력 유지를 통한 시각-언어 합성성 향상

초록

본 논문에서는 사전 훈련된 시각과 언어 모델(VLMs)의 합성 이해력을 향상시키는 새로운 방법을 제안합니다. 전통적인 파인 튜닝 접근법은 종종 합성 추론을 향상시키지만 다중 모달 작업의 성능을 희생하는 경향이 있습니다. 이는 주로 이미지와 텍스트의 전역 표현을 대조하는 전역 하드 네거티브(HN) 손실을 사용하기 때문입니다. 이러한 전역 HN 손실은 원본과 매우 유사한 HN 텍스트를 밀어내어 모델의 다중 모달 표현을 손상시킵니다. 이 한계를 극복하기 위해 우리는 로컬 하드 네거티브 손실과 선택적 보정 규제를 통합한 Fine-grained Selective Calibrated CLIP (FSC-CLIP)를 제안합니다. 이러한 혁신은 모델의 표현 무결성을 보존하면서 미세한 부정적 지도를 제공합니다. 합성 및 다중 모달 작업에 대한 다양한 벤치마크를 통한 철저한 평가 결과, FSC-CLIP는 최첨단 모델과 동등한 수준의 합성을 달성할 뿐만 아니라 강력한 다중 모달 능력을 유지하는 것으로 나타났습니다. 코드는 다음 링크에서 확인할 수 있습니다: https://github.com/ytaek-oh/fsc-clip.

English

In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model's multi-modal representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model's representational integrity. Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities. Code is available at: https://github.com/ytaek-oh/fsc-clip.

사전 훈련된 VLM의 다중 모달 능력 유지를 통한 시각-언어 합성성 향상

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

초록

Support