ExoViP：具有外骨骼模块的逐步验证和探索的组合视觉推理

摘要

组合式视觉推理方法将复杂查询转化为可行视觉任务的结构化组合，已在复杂的多模态任务中展现出强大潜力。受最近大型语言模型（LLMs）的进展的推动，通过将LLMs视为少样本/零样本规划器，即视觉-语言（VL）编程，这一多模态挑战已经迈入新阶段。尽管这些方法具有许多优点，但由于LLM规划错误或视觉执行模块的准确性不足而面临挑战，落后于非组合模型。在这项工作中，我们设计了一种“即插即用”方法ExoViP，通过内省验证来纠正规划和执行阶段的错误。我们利用验证模块作为“外骨骼”来增强当前的VL编程方案。具体而言，我们提出的验证模块利用三个子验证器的混合来验证每个推理步骤后的预测，随后校准视觉模块的预测并优化LLMs规划的推理轨迹。在两种代表性的VL编程方法上的实验结果展示了在标准基准上五个组合推理任务上的一致改进。基于此，我们相信ExoViP可以促进在开放领域多模态挑战中的更好性能和泛化能力。

English

Compositional visual reasoning methods, which translate a complex query into a structured composition of feasible visual tasks, have exhibited a strong potential in complicated multi-modal tasks. Empowered by recent advances in large language models (LLMs), this multi-modal challenge has been brought to a new stage by treating LLMs as few-shot/zero-shot planners, i.e., vision-language (VL) programming. Such methods, despite their numerous merits, suffer from challenges due to LLM planning mistakes or inaccuracy of visual execution modules, lagging behind the non-compositional models. In this work, we devise a "plug-and-play" method, ExoViP, to correct errors in both the planning and execution stages through introspective verification. We employ verification modules as "exoskeletons" to enhance current VL programming schemes. Specifically, our proposed verification module utilizes a mixture of three sub-verifiers to validate predictions after each reasoning step, subsequently calibrating the visual module predictions and refining the reasoning trace planned by LLMs. Experimental results on two representative VL programming methods showcase consistent improvements on five compositional reasoning tasks on standard benchmarks. In light of this, we believe that ExoViP can foster better performance and generalization on open-domain multi-modal challenges.

ExoViP：具有外骨骼模块的逐步验证和探索的组合视觉推理

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

摘要

Support