Minder Details, Betere Antwoorden: Degradatie-gestuurd Aanmoedigen voor VQA

Samenvatting

Recente vooruitgang in Vision-Language Models (VLMs) heeft de grenzen van Visuele Vraagbeantwoording (VQA) aanzienlijk verlegd. Echter kunnen hoogresolutie details soms ruis worden die leidt tot hallucinaties of redeneerfouten. In dit artikel stellen we Degradation-Driven Prompting (DDP) voor, een nieuw raamwerk dat de VQA-prestaties verbetert door strategisch de beeldkwaliteit te verminderen om modellen te dwingen zich te concentreren op essentiële structurele informatie. We evalueren DDP in twee verschillende taken. *Physical attributes* richt zich op beelden die gevoelig zijn voor menselijke misinschatting, waarbij DDP een combinatie gebruikt van 80% downsampling, structurele visuele hulpmiddelen (witte achtergrondmaskers en orthometrische lijnen) en In-Context Learning (ICL) om de focus van het model te kalibreren. *Perceptual phenomena* behandelt diverse visuele anomalieën en illusies waar machines gevoelig voor zijn, waaronder Visuele Anomalie (VA), Kleurillusie (CI), Bewegingsillusie (MI), Gestalt (GI), Geometrische (GSI) en Visuele Illusies (VI). Voor deze taak integreert DDP een taakclassificatiefase met gespecialiseerde tools zoals vervagingsmaskers en contrastverbetering naast downsampling. Onze experimentele resultaten tonen aan dat minder meer is: door visuele invoer opzettelijk te degraderen en gerichte structurele prompts te verstrekken, stelt DDP VLMs in staat om afleidende texturen te omzeilen en superieure redeneernauwkeurigheid te bereiken op uitdagende visuele benchmarks.

English

Recent advancements in Vision-Language Models (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution details can sometimes become noise that leads to hallucinations or reasoning errors. In this paper,we propose Degradation-Driven Prompting (DDP), a novel framework that improves VQA performance by strategically reducing image fidelity to force models to focus on essential structural information. We evaluate DDP across two distinct tasks. Physical attributes targets images prone to human misjudgment, where DDP employs a combination of 80p downsampling, structural visual aids (white background masks and orthometric lines), and In-Context Learning (ICL) to calibrate the model's focus. Perceptual phenomena addresses various machine-susceptible visual anomalies and illusions, including Visual Anomaly (VA), Color (CI), Motion(MI),Gestalt (GI), Geometric (GSI), and Visual Illusions (VI).For this task, DDP integrates a task-classification stage with specialized tools such as blur masks and contrast enhancement alongside downsampling. Our experimental results demonstrate that less is more: by intentionally degrading visual inputs and providing targeted structural prompts, DDP enables VLMs to bypass distracting textures and achieve superior reasoning accuracy on challenging visual benchmarks.

Minder Details, Betere Antwoorden: Degradatie-gestuurd Aanmoedigen voor VQA

Less Detail, Better Answers: Degradation-Driven Prompting for VQA

Samenvatting

Support