색조가 입힌 프레임: 질문 프레이밍이 시각-언어 모델의 시야를 가린다

초록

시각-언어 모델(VLMs)은 시각 추론이 필요한 작업에서도 시각 입력을 충분히 활용하지 못하는 등 시각적 맹점을 보이는 것으로 알려져 있습니다. 본 연구에서는 VLMs이 선택적으로 맹점을 보인다는 것을 입증합니다. 이들은 동일한 시각 추론을 요구하는 상황에서도 언어적 프레이밍에 따라 시각 입력에 적용하는 주의(attention)의 양을 조절합니다. 시각 주의를 프로브(probe)로 활용하여 프레이밍이 이미지에 대한 주의의 양과 분포를 어떻게 변화시키는지 정량적으로 분석했습니다. 객관식이나 예/아니오 질문과 같은 제한적 프레이밍은 개방형 프레이밍에 비해 이미지 문맥에 대한 주의를 현저히 낮추고, 작업 관련 영역에 대한 초점을 감소시키며, 정보가 없는 토큰으로 주의를 이동시킵니다. 더 나아가 이러한 주의 할당 오류가 정확도 저하 및 프레이밍 간 불일치의 주된 원인임을 입증합니다. 이러한 메커니즘적 통찰을 바탕으로, 학습 가능한 토큰을 사용한 경량 프롬프트 튜닝 방법을 제안합니다. 이 방법은 개방형 설정에서 관찰되는 강건하고 시각에 기반한 주의 패턴을 유도하여 시각적 근거 강화 및 다양한 프레이밍에서의 성능 향상을 달성합니다.

English

Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.

색조가 입힌 프레임: 질문 프레이밍이 시각-언어 모델의 시야를 가린다

Tinted Frames: Question Framing Blinds Vision-Language Models

초록

Support