ExoViP：具有外骨骼模組的逐步驗證和探索，用於組合式視覺推理

摘要

組合式視覺推理方法將複雜查詢轉換為可行視覺任務的結構化組合，展現了在複雜多模式任務中的強大潛力。受到大型語言模型（LLMs）最新進展的推動，這種多模式挑戰已被提升到一個新階段，將LLMs視為少樣本/零樣本規劃者，即視覺語言（VL）編程。儘管這些方法具有眾多優點，但由於LLM規劃錯誤或視覺執行模組的不準確性，它們面臨挑戰，落後於非組合模型。在本研究中，我們設計了一種“即插即用”方法ExoViP，通過內省驗證來糾正規劃和執行階段的錯誤。我們將驗證模組作為“外骨骼”來增強當前的VL編程方案。具體來說，我們提出的驗證模組利用三個子驗證器的混合來驗證每個推理步驟後的預測，隨後校準視覺模組的預測並優化LLMs規劃的推理軌跡。在兩種代表性VL編程方法上的實驗結果展示了對標準基準上的五個組合推理任務的一致改進。基於此，我們相信ExoViP可以促進在開放域多模式挑戰上的更好性能和泛化。

English

Compositional visual reasoning methods, which translate a complex query into a structured composition of feasible visual tasks, have exhibited a strong potential in complicated multi-modal tasks. Empowered by recent advances in large language models (LLMs), this multi-modal challenge has been brought to a new stage by treating LLMs as few-shot/zero-shot planners, i.e., vision-language (VL) programming. Such methods, despite their numerous merits, suffer from challenges due to LLM planning mistakes or inaccuracy of visual execution modules, lagging behind the non-compositional models. In this work, we devise a "plug-and-play" method, ExoViP, to correct errors in both the planning and execution stages through introspective verification. We employ verification modules as "exoskeletons" to enhance current VL programming schemes. Specifically, our proposed verification module utilizes a mixture of three sub-verifiers to validate predictions after each reasoning step, subsequently calibrating the visual module predictions and refining the reasoning trace planned by LLMs. Experimental results on two representative VL programming methods showcase consistent improvements on five compositional reasoning tasks on standard benchmarks. In light of this, we believe that ExoViP can foster better performance and generalization on open-domain multi-modal challenges.

ExoViP：具有外骨骼模組的逐步驗證和探索，用於組合式視覺推理

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

摘要

Support