적은 디테일, 더 나은 답변: VQA를 위한 저하 기반 프롬프팅

초록

비전-언어 모델(VLM)의 최근 발전은 시각 질의응답(VQA)의 한계를 크게 확장시켰습니다. 그러나 고해상도 세부 정보가 때로는 환각이나 추론 오류를 유발하는 잡음이 될 수 있습니다. 본 논문에서는 이미지 충실도를 전략적으로 저하시켜 모델이 핵심 구조 정보에 집중하도록 유도함으로써 VQA 성능을 향상시키는 새로운 프레임워크인 저하 기반 프롬프팅(DDP)을 제안합니다. 우리는 DDP를 두 가지 distinct한 과제에서 평가합니다. 물리적 속성 과제는 인간의 오판이 발생하기 쉬운 이미지를 대상으로 하며, DDP는 80p 다운샘플링, 구조적 시각 보조 도구(흰색 배경 마스크 및 정사영선), In-Context Learning(ICL)의 조합을 활용하여 모델의 주의를 보정합니다. 지각 현상 과제는 시각적 이상(VA), 색상(CI), 운동(MI), 게슈탈트(GI), 기하학적(GSI), 시각 착시(VI)를 포함한 다양한 기계가 취약한 시각적 변칙 및 착시를 다룹니다. 이를 위해 DDP는 다운샘플링과 함께 블러 마스크 및 대비 향상과 같은 전문 도구를 과제 분류 단계와 통합합니다. 우리의 실험 결과는 "적은 것이 더 많다"는 것을 보여줍니다: 시각 입력을 의도적으로 저하시키고 표적화된 구조적 프롬프트를 제공함으로써, DDP는 VLM이 주의를 분산시키는 질감을 우회하고 도전적인 시각 벤치마크에서 우수한 추론 정확도를 달성할 수 있게 합니다.

English

Recent advancements in Vision-Language Models (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution details can sometimes become noise that leads to hallucinations or reasoning errors. In this paper,we propose Degradation-Driven Prompting (DDP), a novel framework that improves VQA performance by strategically reducing image fidelity to force models to focus on essential structural information. We evaluate DDP across two distinct tasks. Physical attributes targets images prone to human misjudgment, where DDP employs a combination of 80p downsampling, structural visual aids (white background masks and orthometric lines), and In-Context Learning (ICL) to calibrate the model's focus. Perceptual phenomena addresses various machine-susceptible visual anomalies and illusions, including Visual Anomaly (VA), Color (CI), Motion(MI),Gestalt (GI), Geometric (GSI), and Visual Illusions (VI).For this task, DDP integrates a task-classification stage with specialized tools such as blur masks and contrast enhancement alongside downsampling. Our experimental results demonstrate that less is more: by intentionally degrading visual inputs and providing targeted structural prompts, DDP enables VLMs to bypass distracting textures and achieve superior reasoning accuracy on challenging visual benchmarks.

적은 디테일, 더 나은 답변: VQA를 위한 저하 기반 프롬프팅

Less Detail, Better Answers: Degradation-Driven Prompting for VQA

초록

Support