色付きフレーム：質問の枠組みが視覚言語モデルの判断を曇らせる

要旨

視覚言語モデル（VLM）は、視覚的推論を必要とするタスクにおいてさえ、視覚入力を十分に活用せず、しばしば「盲目」であることが示されている。本研究では、VLMが「選択的に盲目」であることを明らかにする。つまり、代替的な枠組みが同一の視覚的推論を要求する場合であっても、言語的枠組みに基づいて視覚入力に適用する注意の量を調節するのである。視覚的注意をプローブとして用いることで、枠組みが画像全体への注意の量と分布をどのように変化させるかを定量化する。多肢選択やYes/Noのような制約的な枠組みは、自由記述式の枠組みと比較して、画像の文脈への注意を大幅に低下させ、タスク関連領域への焦点を減少させ、情報量の少ないトークンへと注意をシフトさせる。さらに、この注意の誤配分が、精度の低下と枠組み間の不一致の主原因であることを実証する。このメカニズムに基づく知見を踏まえ、学習可能なトークンを用いた軽量なプロンプトチューニング手法を提案する。この手法は、自由記述式設定で観察される頑健で視覚に基づいた注意パターンを促進し、視覚的接地を改善し、様々な枠組みにわたる性能を向上させる。

English

Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.

色付きフレーム：質問の枠組みが視覚言語モデルの判断を曇らせる

Tinted Frames: Question Framing Blinds Vision-Language Models

要旨

Support