ExoViP:具有外骨骼模块的逐步验证和探索的组合视觉推理
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
August 5, 2024
作者: Yuxuan Wang, Alan Yuille, Zhuowan Li, Zilong Zheng
cs.AI
摘要
组合式视觉推理方法将复杂查询转化为可行视觉任务的结构化组合,已在复杂的多模态任务中展现出强大潜力。受最近大型语言模型(LLMs)的进展的推动,通过将LLMs视为少样本/零样本规划器,即视觉-语言(VL)编程,这一多模态挑战已经迈入新阶段。尽管这些方法具有许多优点,但由于LLM规划错误或视觉执行模块的准确性不足而面临挑战,落后于非组合模型。在这项工作中,我们设计了一种“即插即用”方法ExoViP,通过内省验证来纠正规划和执行阶段的错误。我们利用验证模块作为“外骨骼”来增强当前的VL编程方案。具体而言,我们提出的验证模块利用三个子验证器的混合来验证每个推理步骤后的预测,随后校准视觉模块的预测并优化LLMs规划的推理轨迹。在两种代表性的VL编程方法上的实验结果展示了在标准基准上五个组合推理任务上的一致改进。基于此,我们相信ExoViP可以促进在开放领域多模态挑战中的更好性能和泛化能力。
English
Compositional visual reasoning methods, which translate a complex query into
a structured composition of feasible visual tasks, have exhibited a strong
potential in complicated multi-modal tasks. Empowered by recent advances in
large language models (LLMs), this multi-modal challenge has been brought to a
new stage by treating LLMs as few-shot/zero-shot planners, i.e.,
vision-language (VL) programming. Such methods, despite their numerous merits,
suffer from challenges due to LLM planning mistakes or inaccuracy of visual
execution modules, lagging behind the non-compositional models. In this work,
we devise a "plug-and-play" method, ExoViP, to correct errors in both the
planning and execution stages through introspective verification. We employ
verification modules as "exoskeletons" to enhance current VL programming
schemes. Specifically, our proposed verification module utilizes a mixture of
three sub-verifiers to validate predictions after each reasoning step,
subsequently calibrating the visual module predictions and refining the
reasoning trace planned by LLMs. Experimental results on two representative VL
programming methods showcase consistent improvements on five compositional
reasoning tasks on standard benchmarks. In light of this, we believe that
ExoViP can foster better performance and generalization on open-domain
multi-modal challenges.Summary
AI-Generated Summary