단어와 시각: 비전-언어 모델은 텍스트를 맹목적으로 신뢰하는가?

초록

비전-언어 모델(VLMs)은 시각 중심 작업에서 시각 및 텍스트 정보를 통합하는 데 탁월한 성능을 보이지만, 모달리티 간 불일치를 처리하는 방식은 아직 충분히 연구되지 않았습니다. 본 연구는 시각 중심 환경에서 시각 데이터와 다양한 텍스트 입력이 주어졌을 때 VLMs의 모달리티 선호도를 조사합니다. 네 가지 시각 중심 작업에 텍스트 변형을 도입하고 열 가지 비전-언어 모델(VLMs)을 평가한 결과, "텍스트에 대한 맹목적 신뢰" 현상을 발견했습니다: VLMs은 모달리티 간 불일치가 발생할 때 시각 데이터보다 텍스트 데이터를 지나치게 신뢰하여, 손상된 텍스트 하에서 성능이 크게 저하되고 안전 문제가 발생했습니다. 우리는 이 텍스트 편향에 영향을 미치는 요인들을 분석했는데, 여기에는 명령 프롬프트, 언어 모델 크기, 텍스트 관련성, 토큰 순서, 그리고 시각적 및 텍스트적 확실성 간의 상호작용이 포함됩니다. 언어 모델 크기를 확장하는 것과 같은 일부 요인은 텍스트 편향을 약간 완화시키지만, 토큰 순서와 같은 다른 요인들은 언어 모델에서 상속된 위치 편향으로 인해 이를 악화시킬 수 있습니다. 이 문제를 해결하기 위해 텍스트 증강을 통한 지도 미세 조정을 탐구하고, 이를 통해 텍스트 편향을 줄이는 데 효과적임을 입증했습니다. 또한, 이론적 분석을 통해 텍스트에 대한 맹목적 신뢰 현상이 훈련 중 순수 텍스트 데이터와 다중 모달 데이터 간의 불균형에서 비롯될 수 있음을 제시합니다. 우리의 연구 결과는 VLMs의 강건성과 신뢰성을 향상시키기 위해 다중 모달 데이터 불일치를 처리할 때 균형 잡힌 훈련과 모달리티 상호작용에 대한 신중한 고려가 필요함을 강조합니다.

English

Vision-Language Models (VLMs) excel in integrating visual and textual information for vision-centric tasks, but their handling of inconsistencies between modalities is underexplored. We investigate VLMs' modality preferences when faced with visual data and varied textual inputs in vision-centered settings. By introducing textual variations to four vision-centric tasks and evaluating ten Vision-Language Models (VLMs), we discover a ``blind faith in text'' phenomenon: VLMs disproportionately trust textual data over visual data when inconsistencies arise, leading to significant performance drops under corrupted text and raising safety concerns. We analyze factors influencing this text bias, including instruction prompts, language model size, text relevance, token order, and the interplay between visual and textual certainty. While certain factors, such as scaling up the language model size, slightly mitigate text bias, others like token order can exacerbate it due to positional biases inherited from language models. To address this issue, we explore supervised fine-tuning with text augmentation and demonstrate its effectiveness in reducing text bias. Additionally, we provide a theoretical analysis suggesting that the blind faith in text phenomenon may stem from an imbalance of pure text and multi-modal data during training. Our findings highlight the need for balanced training and careful consideration of modality interactions in VLMs to enhance their robustness and reliability in handling multi-modal data inconsistencies.

단어와 시각: 비전-언어 모델은 텍스트를 맹목적으로 신뢰하는가?

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

초록

Support