ExoViP:具有外骨骼模組的逐步驗證和探索,用於組合式視覺推理
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
August 5, 2024
作者: Yuxuan Wang, Alan Yuille, Zhuowan Li, Zilong Zheng
cs.AI
摘要
組合式視覺推理方法將複雜查詢轉換為可行視覺任務的結構化組合,展現了在複雜多模式任務中的強大潛力。受到大型語言模型(LLMs)最新進展的推動,這種多模式挑戰已被提升到一個新階段,將LLMs視為少樣本/零樣本規劃者,即視覺語言(VL)編程。儘管這些方法具有眾多優點,但由於LLM規劃錯誤或視覺執行模組的不準確性,它們面臨挑戰,落後於非組合模型。在本研究中,我們設計了一種“即插即用”方法ExoViP,通過內省驗證來糾正規劃和執行階段的錯誤。我們將驗證模組作為“外骨骼”來增強當前的VL編程方案。具體來說,我們提出的驗證模組利用三個子驗證器的混合來驗證每個推理步驟後的預測,隨後校準視覺模組的預測並優化LLMs規劃的推理軌跡。在兩種代表性VL編程方法上的實驗結果展示了對標準基準上的五個組合推理任務的一致改進。基於此,我們相信ExoViP可以促進在開放域多模式挑戰上的更好性能和泛化。
English
Compositional visual reasoning methods, which translate a complex query into
a structured composition of feasible visual tasks, have exhibited a strong
potential in complicated multi-modal tasks. Empowered by recent advances in
large language models (LLMs), this multi-modal challenge has been brought to a
new stage by treating LLMs as few-shot/zero-shot planners, i.e.,
vision-language (VL) programming. Such methods, despite their numerous merits,
suffer from challenges due to LLM planning mistakes or inaccuracy of visual
execution modules, lagging behind the non-compositional models. In this work,
we devise a "plug-and-play" method, ExoViP, to correct errors in both the
planning and execution stages through introspective verification. We employ
verification modules as "exoskeletons" to enhance current VL programming
schemes. Specifically, our proposed verification module utilizes a mixture of
three sub-verifiers to validate predictions after each reasoning step,
subsequently calibrating the visual module predictions and refining the
reasoning trace planned by LLMs. Experimental results on two representative VL
programming methods showcase consistent improvements on five compositional
reasoning tasks on standard benchmarks. In light of this, we believe that
ExoViP can foster better performance and generalization on open-domain
multi-modal challenges.Summary
AI-Generated Summary