BBA:用于大型视觉-语言模型推理的双模态行为对齐
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models
February 21, 2024
作者: Xueliang Zhao, Xinting Huang, Tingchen Fu, Qintong Li, Shansan Gong, Lemao Liu, Wei Bi, Lingpeng Kong
cs.AI
摘要
多模态推理是大型视觉-语言模型(LVLMs)的关键能力。与领域特定语言(DSL)集成,提供精确的视觉表示,使这些模型有机会在复杂和专业领域执行更准确的推理。然而,传统的“思维链”(CoT)提示方法在有效利用视觉和DSL表示的独特优势方面面临挑战,主要是因为它们的推理机制不同。此外,在处理多步推理任务中经常无法解决关键步骤。为了解决这些挑战,我们引入了“双模态行为对齐”(BBA)提示方法,旨在最大限度地发挥DSL在增强复杂多模态推理任务中的潜力。该方法首先引导LVLMs为视觉和DSL表示创建单独的推理链。随后,通过解决任何不一致之处,将这些链进行对齐,从而实现不同模态行为的统一整合。我们的实验表明,BBA显著改善了GPT-4V(视觉)在几何问题解决(从28.34%到34.22%)、国际象棋位置优势预测(从42.08%到46.99%)和分子性质预测(从77.47%到83.52%)方面的性能。
English
Multimodal reasoning stands as a pivotal capability for large vision-language
models (LVLMs). The integration with Domain-Specific Languages (DSL), offering
precise visual representations, equips these models with the opportunity to
execute more accurate reasoning in complex and professional domains. However,
the vanilla Chain-of-Thought (CoT) prompting method faces challenges in
effectively leveraging the unique strengths of visual and DSL representations,
primarily due to their differing reasoning mechanisms. Additionally, it often
falls short in addressing critical steps in multi-step reasoning tasks. To
mitigate these challenges, we introduce the Bi-Modal
Behavioral Alignment (BBA) prompting method, designed
to maximize the potential of DSL in augmenting complex multi-modal reasoning
tasks. This method initiates by guiding LVLMs to create separate reasoning
chains for visual and DSL representations. Subsequently, it aligns these chains
by addressing any inconsistencies, thus achieving a cohesive integration of
behaviors from different modalities. Our experiments demonstrate that BBA
substantially improves the performance of GPT-4V(ision) on geometry problem
solving (28.34% to 34.22%), chess positional advantage prediction
(42.08% to 46.99%) and molecular property prediction (77.47% to
83.52%).Summary
AI-Generated Summary