픽셀 대 프라이어: 시각적 반사실을 통한 시각-언어 모델의 지식 프라이어 제어

초록

멀티모달 대형 언어 모델(MLLMs)은 시각적 질문 응답과 같은 작업에서 우수한 성능을 보이지만, 이들의 추론이 더 많이 의존하는 것이 기억된 세계 지식인지 입력 이미지에 포함된 시각적 정보인지는 여전히 불분명합니다. 이를 조사하기 위해, 우리는 세계 지식 사전(예: 빨간 딸기)과 시각적 입력(예: 파란 딸기)을 직접적으로 충돌시키는 시각적으로 사실적인 반사실적 데이터셋인 Visual CounterFact를 소개합니다. Visual CounterFact를 사용하여, 모델 예측이 초기에는 기억된 사전을 반영하지만 중간에서 후반 레이어로 갈수록 시각적 증거로 이동함을 보여줍니다. 이러한 동적은 두 모달리티 간의 경쟁을 드러내며, 평가 과정에서 시각적 입력이 결국 사전을 재정의함을 보여줍니다. 이러한 행동을 제어하기 위해, 우리는 Pixels Versus Priors(PvP) 스티어링 벡터를 제안합니다. 이는 활성화 수준의 개입을 통해 모델 출력을 세계 지식 또는 시각적 입력 중 하나로 제어하는 메커니즘입니다. 평균적으로, PvP는 색상 예측의 92.5%와 크기 예측의 74.6%를 사전에서 반사실적 예측으로 성공적으로 전환합니다. 이러한 발견들은 멀티모달 모델에서 사실적 행동을 해석하고 제어하기 위한 새로운 도구를 제공합니다.

English

Multimodal Large Language Models (MLLMs) perform well on tasks such as visual question answering, but it remains unclear whether their reasoning relies more on memorized world knowledge or on the visual information present in the input image. To investigate this, we introduce Visual CounterFact, a new dataset of visually-realistic counterfactuals that put world knowledge priors (e.g, red strawberry) into direct conflict with visual input (e.g, blue strawberry). Using Visual CounterFact, we show that model predictions initially reflect memorized priors, but shift toward visual evidence in mid-to-late layers. This dynamic reveals a competition between the two modalities, with visual input ultimately overriding priors during evaluation. To control this behavior, we propose Pixels Versus Priors (PvP) steering vectors, a mechanism for controlling model outputs toward either world knowledge or visual input through activation-level interventions. On average, PvP successfully shifts 92.5% of color and 74.6% of size predictions from priors to counterfactuals. Together, these findings offer new tools for interpreting and controlling factual behavior in multimodal models.

픽셀 대 프라이어: 시각적 반사실을 통한 시각-언어 모델의 지식 프라이어 제어

Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

초록

Support