像素与先验：通过视觉反事实控制视觉-语言模型中的知识先验

摘要

多模态大语言模型（MLLMs）在视觉问答等任务中表现出色，但其推理过程更多依赖于记忆中的世界知识，还是输入图像中的视觉信息，尚不明确。为探究这一问题，我们引入了Visual CounterFact，一个包含视觉真实性反事实的新数据集，该数据集将世界知识先验（如红色草莓）与视觉输入（如蓝色草莓）直接对立。通过使用Visual CounterFact，我们发现模型预测最初反映的是记忆中的先验知识，但在模型的中后期层逐渐转向视觉证据。这一动态揭示了两种模态之间的竞争，最终在评估阶段视觉输入会覆盖先验知识。为控制这一行为，我们提出了“像素对先验”（PvP）导向向量，这是一种通过激活层干预来引导模型输出偏向世界知识或视觉输入的机制。平均而言，PvP成功地将92.5%的颜色预测和74.6%的大小预测从先验知识转向反事实。这些发现共同为解释和控制多模态模型中的事实行为提供了新工具。

English

Multimodal Large Language Models (MLLMs) perform well on tasks such as visual question answering, but it remains unclear whether their reasoning relies more on memorized world knowledge or on the visual information present in the input image. To investigate this, we introduce Visual CounterFact, a new dataset of visually-realistic counterfactuals that put world knowledge priors (e.g, red strawberry) into direct conflict with visual input (e.g, blue strawberry). Using Visual CounterFact, we show that model predictions initially reflect memorized priors, but shift toward visual evidence in mid-to-late layers. This dynamic reveals a competition between the two modalities, with visual input ultimately overriding priors during evaluation. To control this behavior, we propose Pixels Versus Priors (PvP) steering vectors, a mechanism for controlling model outputs toward either world knowledge or visual input through activation-level interventions. On average, PvP successfully shifts 92.5% of color and 74.6% of size predictions from priors to counterfactuals. Together, these findings offer new tools for interpreting and controlling factual behavior in multimodal models.

像素与先验：通过视觉反事实控制视觉-语言模型中的知识先验

Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

摘要

Support