像素與先驗：透過視覺反事實控制視覺-語言模型中的知識先驗

摘要

多模態大型語言模型（MLLMs）在視覺問答等任務上表現出色，但其推理過程更多依賴於記憶中的世界知識還是輸入圖像中的視覺信息，仍不明確。為探究此問題，我們引入了Visual CounterFact，這是一個包含視覺真實反事實的新數據集，它將世界知識先驗（例如，紅色的草莓）與視覺輸入（例如，藍色的草莓）直接對立。利用Visual CounterFact，我們發現模型預測最初反映的是記憶中的先驗，但在中後期層次轉向視覺證據。這一動態揭示了兩種模態之間的競爭，最終在評估過程中視覺輸入壓倒了先驗。為控制此行為，我們提出了「像素對抗先驗」（PvP）引導向量，這是一種通過激活層級干預來控制模型輸出偏向世界知識或視覺輸入的機制。平均而言，PvP成功將92.5%的顏色預測和74.6%的大小預測從先驗轉向反事實。這些發現共同為解釋和控制多模態模型中的事實行為提供了新工具。

English

Multimodal Large Language Models (MLLMs) perform well on tasks such as visual question answering, but it remains unclear whether their reasoning relies more on memorized world knowledge or on the visual information present in the input image. To investigate this, we introduce Visual CounterFact, a new dataset of visually-realistic counterfactuals that put world knowledge priors (e.g, red strawberry) into direct conflict with visual input (e.g, blue strawberry). Using Visual CounterFact, we show that model predictions initially reflect memorized priors, but shift toward visual evidence in mid-to-late layers. This dynamic reveals a competition between the two modalities, with visual input ultimately overriding priors during evaluation. To control this behavior, we propose Pixels Versus Priors (PvP) steering vectors, a mechanism for controlling model outputs toward either world knowledge or visual input through activation-level interventions. On average, PvP successfully shifts 92.5% of color and 74.6% of size predictions from priors to counterfactuals. Together, these findings offer new tools for interpreting and controlling factual behavior in multimodal models.

像素與先驗：透過視覺反事實控制視覺-語言模型中的知識先驗

Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

摘要

Support