精简细节，优化答案：面向VQA的降质驱动提示法

摘要

近期视觉语言模型（VLM）的进展显著拓展了视觉问答（VQA）的能力边界。然而，高分辨率细节有时会转化为干扰信息，导致模型产生幻觉或推理错误。本文提出降质驱动提示（DDP）框架，该创新方法通过策略性降低图像保真度，迫使模型聚焦于本质结构信息，从而提升VQA性能。我们在两项差异化任务中评估DDP：物理属性任务针对易引发人类误判的图像，DDP融合80%下采样、结构化视觉辅助（白色背景遮罩与正交投影线）及上下文学习（ICL）来校准模型关注点；感知现象任务则面向机器易感的视觉异常与错觉，包括视觉异常（VA）、颜色（CI）、运动（MI）、格式塔（GI）、几何（GSI）及视觉错觉（VI）。对此DDP引入任务分类阶段，并结合模糊遮罩、对比度增强等专用工具与下采样技术。实验结果表明“少即是多”：通过刻意降低视觉输入质量并提供靶向结构提示，DDP能使VLM有效规避干扰性纹理，在挑战性视觉基准测试中实现更优的推理精度。

English

Recent advancements in Vision-Language Models (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution details can sometimes become noise that leads to hallucinations or reasoning errors. In this paper,we propose Degradation-Driven Prompting (DDP), a novel framework that improves VQA performance by strategically reducing image fidelity to force models to focus on essential structural information. We evaluate DDP across two distinct tasks. Physical attributes targets images prone to human misjudgment, where DDP employs a combination of 80p downsampling, structural visual aids (white background masks and orthometric lines), and In-Context Learning (ICL) to calibrate the model's focus. Perceptual phenomena addresses various machine-susceptible visual anomalies and illusions, including Visual Anomaly (VA), Color (CI), Motion(MI),Gestalt (GI), Geometric (GSI), and Visual Illusions (VI).For this task, DDP integrates a task-classification stage with specialized tools such as blur masks and contrast enhancement alongside downsampling. Our experimental results demonstrate that less is more: by intentionally degrading visual inputs and providing targeted structural prompts, DDP enables VLMs to bypass distracting textures and achieve superior reasoning accuracy on challenging visual benchmarks.