与视觉专家共同起草与完善方案
Draft and Refine with Visual Experts
November 14, 2025
作者: Sungheon Jeong, Ryozo Masukawa, Jihong Park, Sanggeon Yun, Wenjun Huang, Hanning Chen, Mahdi Imani, Mohsen Imani
cs.AI
摘要
尽管当前的大型视觉语言模型(LVLM)展现出强大的多模态推理能力,但由于过度依赖语言先验而非视觉证据,其常产生缺乏依据或虚构的响应。这一局限凸显出尚缺乏量化指标来衡量模型在推理过程中对视觉信息实际利用的程度。我们提出基于问题条件化利用度量的"草拟-优化"(DnR)智能体框架:该框架首先通过构建查询条件化的相关性图谱来定位问题相关线索,再通过相关性引导的概率掩码测量模型依赖度,从而量化模型对视觉证据的依赖程度。在此度量引导下,DnR智能体借助外部视觉专家的定向反馈优化初始回答——将各专家输出(如检测框或掩码)渲染为图像上的视觉提示,通过重新查询模型选择能最大程度提升利用度的响应。该方法无需重新训练或改变模型架构即可增强视觉基础。在视觉问答和图像描述基准测试中,实验显示出持续的性能提升与幻觉现象减少,证明视觉利用度测量为构建更可解释、证据驱动的多模态智能体系统提供了原理性路径。
English
While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model's reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert's output (such as boxes or masks) is rendered as visual cues on the image, and the model is re-queried to select the response that yields the largest improvement in utilization. This process strengthens visual grounding without retraining or architectural changes. Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating that measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems.