与视觉专家共同起草与完善方案
Draft and Refine with Visual Experts
November 14, 2025
作者: Sungheon Jeong, Ryozo Masukawa, Jihong Park, Sanggeon Yun, Wenjun Huang, Hanning Chen, Mahdi Imani, Mohsen Imani
cs.AI
摘要
尽管当前的大型视觉语言模型(LVLM)展现出强大的多模态推理能力,但由于过度依赖语言先验而非视觉证据,它们常常产生缺乏依据或虚构的回应。这一局限性凸显出现有研究缺乏对模型在推理过程中实际使用视觉信息程度的量化衡量标准。我们提出基于问题条件化效用度量的"草拟-修正"(DnR)智能体框架:该框架首先通过构建查询条件化关联图来定位问题相关线索,继而通过关联引导的概率掩码测量模型依赖度,从而量化模型对视觉证据的依赖程度。在此度量标准引导下,DnR智能体借助外部视觉专家的定向反馈修正初始回答——将每位专家输出(如检测框或掩码)渲染为图像上的视觉线索后重新查询模型,选择能最大程度提升视觉利用率的回应。该方法无需重新训练或改变模型架构即可增强视觉基础。在视觉问答和图像描述基准测试中,实验结果显示模型准确率持续提升且幻觉现象减少,证明衡量视觉利用率为构建更可解释、证据驱动的多模态智能体系统提供了理论路径。
English
While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model's reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert's output (such as boxes or masks) is rendered as visual cues on the image, and the model is re-queried to select the response that yields the largest improvement in utilization. This process strengthens visual grounding without retraining or architectural changes. Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating that measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems.