文字或視覺：視覺語言模型是否對文本盲目信任？

摘要

視覺-語言模型（VLMs）在整合視覺與文本資訊以執行視覺中心任務方面表現卓越，然而，它們在處理模態間不一致性方面的能力尚未被充分探討。本研究探討了在視覺中心情境下，當面對視覺數據與多樣化文本輸入時，VLMs的模態偏好。通過在四個視覺中心任務中引入文本變體並評估十種視覺-語言模型，我們發現了一種「盲目信任文本」的現象：當出現不一致時，VLMs過度依賴文本數據而非視覺數據，這導致在文本受損時性能顯著下降，並引發安全隱患。我們分析了影響這種文本偏見的因素，包括指令提示、語言模型規模、文本相關性、詞序以及視覺與文本確定性之間的相互作用。雖然某些因素（如擴大語言模型規模）能略微減輕文本偏見，但其他因素（如詞序）可能因語言模型繼承的位置偏見而加劇這一問題。為解決此問題，我們探索了基於文本增強的監督微調，並證明了其在減少文本偏見方面的有效性。此外，我們提供了一項理論分析，表明「盲目信任文本」現象可能源於訓練過程中純文本與多模態數據的不平衡。我們的研究結果強調了在VLMs中實現平衡訓練及謹慎考慮模態交互的必要性，以增強其在處理多模態數據不一致性時的魯棒性與可靠性。

English

Vision-Language Models (VLMs) excel in integrating visual and textual information for vision-centric tasks, but their handling of inconsistencies between modalities is underexplored. We investigate VLMs' modality preferences when faced with visual data and varied textual inputs in vision-centered settings. By introducing textual variations to four vision-centric tasks and evaluating ten Vision-Language Models (VLMs), we discover a ``blind faith in text'' phenomenon: VLMs disproportionately trust textual data over visual data when inconsistencies arise, leading to significant performance drops under corrupted text and raising safety concerns. We analyze factors influencing this text bias, including instruction prompts, language model size, text relevance, token order, and the interplay between visual and textual certainty. While certain factors, such as scaling up the language model size, slightly mitigate text bias, others like token order can exacerbate it due to positional biases inherited from language models. To address this issue, we explore supervised fine-tuning with text augmentation and demonstrate its effectiveness in reducing text bias. Additionally, we provide a theoretical analysis suggesting that the blind faith in text phenomenon may stem from an imbalance of pure text and multi-modal data during training. Our findings highlight the need for balanced training and careful consideration of modality interactions in VLMs to enhance their robustness and reliability in handling multi-modal data inconsistencies.

文字或視覺：視覺語言模型是否對文本盲目信任？

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

摘要

Support