精简细节，优化答案：面向VQA的降质驱动提示法

摘要

近年來，視覺語言模型（VLM）的突破性進展大幅推進了視覺問答（VQA）的技術邊界。然而，高解析度細節有時會成為干擾訊號，導致模型產生幻覺或推理錯誤。本文提出「降質驅動提示法」（DDP），該創新框架通過策略性降低圖像保真度，迫使模型聚焦於關鍵結構資訊，從而提升VQA性能。我們在兩類任務中驗證DDP的效能：在「物理屬性判斷」任務中，針對易使人類產生誤判的圖像，DDP融合80%降採樣、結構化視覺輔助（白底遮罩與正交標線）及情境學習（ICL）來校準模型關注點；在「感知現象解析」任務中，針對機器易誤判的視覺異常與錯覺（含視覺異常、色彩錯覺、運動錯覺、格式塔錯覺、幾何錯覺及視覺幻覺），DDP結合任務分類階段與專用工具（如模糊遮罩與對比度增強）進行降採樣處理。實驗結果證實「少即是多」：通過刻意降低視覺輸入品質並提供精準的結構化提示，DDP能引導VLM避開干擾性紋理，在挑戰性視覺基準測試中實現更優異的推理準確度。

English

Recent advancements in Vision-Language Models (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution details can sometimes become noise that leads to hallucinations or reasoning errors. In this paper,we propose Degradation-Driven Prompting (DDP), a novel framework that improves VQA performance by strategically reducing image fidelity to force models to focus on essential structural information. We evaluate DDP across two distinct tasks. Physical attributes targets images prone to human misjudgment, where DDP employs a combination of 80p downsampling, structural visual aids (white background masks and orthometric lines), and In-Context Learning (ICL) to calibrate the model's focus. Perceptual phenomena addresses various machine-susceptible visual anomalies and illusions, including Visual Anomaly (VA), Color (CI), Motion(MI),Gestalt (GI), Geometric (GSI), and Visual Illusions (VI).For this task, DDP integrates a task-classification stage with specialized tools such as blur masks and contrast enhancement alongside downsampling. Our experimental results demonstrate that less is more: by intentionally degrading visual inputs and providing targeted structural prompts, DDP enables VLMs to bypass distracting textures and achieve superior reasoning accuracy on challenging visual benchmarks.

精简细节，优化答案：面向VQA的降质驱动提示法

Less Detail, Better Answers: Degradation-Driven Prompting for VQA

摘要

Support