言葉か視覚か：視覚言語モデルはテキストを盲信しているのか？

要旨

視覚言語モデル（VLMs）は、視覚中心のタスクにおいて視覚情報とテキスト情報を統合する能力に優れているが、モダリティ間の不一致に対する扱いは十分に検証されていない。本研究では、視覚中心の設定において、視覚データと多様なテキスト入力を提示された際のVLMsのモダリティ選好を調査する。4つの視覚中心タスクにテキストのバリエーションを導入し、10種類の視覚言語モデル（VLMs）を評価した結果、「テキストへの盲信」現象を発見した：VLMsは、不一致が生じた際に視覚データよりもテキストデータを過剰に信頼し、破損したテキスト下での性能低下を引き起こし、安全性に関する懸念を提起する。このテキストバイアスに影響を与える要因として、指示プロンプト、言語モデルのサイズ、テキストの関連性、トークンの順序、視覚的およびテキスト的な確実性の相互作用を分析した。言語モデルのサイズを拡大するといった特定の要因はテキストバイアスをわずかに軽減するが、トークンの順序などは言語モデルから継承された位置バイアスによりそれを悪化させる可能性がある。この問題に対処するため、テキスト拡張を用いた教師ありファインチューニングを探索し、その有効性を実証した。さらに、理論的分析を通じて、テキストへの盲信現象は、トレーニング中の純粋なテキストデータとマルチモーダルデータの不均衡に起因する可能性があることを示唆する。我々の知見は、マルチモーダルデータの不一致を扱う際のVLMsの堅牢性と信頼性を向上させるためには、バランスの取れたトレーニングとモダリティ間の相互作用の慎重な考慮が必要であることを強調する。

English

Vision-Language Models (VLMs) excel in integrating visual and textual information for vision-centric tasks, but their handling of inconsistencies between modalities is underexplored. We investigate VLMs' modality preferences when faced with visual data and varied textual inputs in vision-centered settings. By introducing textual variations to four vision-centric tasks and evaluating ten Vision-Language Models (VLMs), we discover a ``blind faith in text'' phenomenon: VLMs disproportionately trust textual data over visual data when inconsistencies arise, leading to significant performance drops under corrupted text and raising safety concerns. We analyze factors influencing this text bias, including instruction prompts, language model size, text relevance, token order, and the interplay between visual and textual certainty. While certain factors, such as scaling up the language model size, slightly mitigate text bias, others like token order can exacerbate it due to positional biases inherited from language models. To address this issue, we explore supervised fine-tuning with text augmentation and demonstrate its effectiveness in reducing text bias. Additionally, we provide a theoretical analysis suggesting that the blind faith in text phenomenon may stem from an imbalance of pure text and multi-modal data during training. Our findings highlight the need for balanced training and careful consideration of modality interactions in VLMs to enhance their robustness and reliability in handling multi-modal data inconsistencies.

言葉か視覚か：視覚言語モデルはテキストを盲信しているのか？

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

要旨

Support