ピクセル対事前知識：視覚的対事実を通じた視覚言語モデルの事前知識制御

要旨

マルチモーダル大規模言語モデル（MLLMs）は、視覚的質問応答などのタスクにおいて優れた性能を発揮するが、その推論が記憶された世界知識に依存しているのか、あるいは入力画像に含まれる視覚情報に依存しているのかは依然として不明である。これを調査するため、我々は視覚的に現実的な反事実（counterfactual）を集めた新しいデータセット「Visual CounterFact」を導入した。このデータセットは、世界知識の事前情報（例：赤いイチゴ）と視覚的入力（例：青いイチゴ）を直接対立させるものである。Visual CounterFactを用いて、モデルの予測が最初は記憶された事前情報を反映するが、中盤から後半の層では視覚的証拠にシフトすることを示した。この動的プロセスは、二つのモダリティ間の競争を明らかにし、評価中に視覚的入力が事前情報を上書きすることを示している。この挙動を制御するため、我々は「Pixels Versus Priors（PvP）ステアリングベクトル」を提案した。これは、活性化レベルでの介入を通じて、モデルの出力を世界知識または視覚的入力のいずれかに制御するメカニズムである。平均的に、PvPは色の予測の92.5％、サイズの予測の74.6％を事前情報から反事実にシフトさせることに成功した。これらの発見は、マルチモーダルモデルにおける事実的挙動を解釈し制御するための新しいツールを提供するものである。

English

Multimodal Large Language Models (MLLMs) perform well on tasks such as visual question answering, but it remains unclear whether their reasoning relies more on memorized world knowledge or on the visual information present in the input image. To investigate this, we introduce Visual CounterFact, a new dataset of visually-realistic counterfactuals that put world knowledge priors (e.g, red strawberry) into direct conflict with visual input (e.g, blue strawberry). Using Visual CounterFact, we show that model predictions initially reflect memorized priors, but shift toward visual evidence in mid-to-late layers. This dynamic reveals a competition between the two modalities, with visual input ultimately overriding priors during evaluation. To control this behavior, we propose Pixels Versus Priors (PvP) steering vectors, a mechanism for controlling model outputs toward either world knowledge or visual input through activation-level interventions. On average, PvP successfully shifts 92.5% of color and 74.6% of size predictions from priors to counterfactuals. Together, these findings offer new tools for interpreting and controlling factual behavior in multimodal models.

ピクセル対事前知識：視覚的対事実を通じた視覚言語モデルの事前知識制御

Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

要旨

Support