BBA：用于大型视觉-语言模型推理的双模态行为对齐

摘要

多模态推理是大型视觉-语言模型（LVLMs）的关键能力。与领域特定语言（DSL）集成，提供精确的视觉表示，使这些模型有机会在复杂和专业领域执行更准确的推理。然而，传统的“思维链”（CoT）提示方法在有效利用视觉和DSL表示的独特优势方面面临挑战，主要是因为它们的推理机制不同。此外，在处理多步推理任务中经常无法解决关键步骤。为了解决这些挑战，我们引入了“双模态行为对齐”（BBA）提示方法，旨在最大限度地发挥DSL在增强复杂多模态推理任务中的潜力。该方法首先引导LVLMs为视觉和DSL表示创建单独的推理链。随后，通过解决任何不一致之处，将这些链进行对齐，从而实现不同模态行为的统一整合。我们的实验表明，BBA显著改善了GPT-4V（视觉）在几何问题解决（从28.34%到34.22%）、国际象棋位置优势预测（从42.08%到46.99%）和分子性质预测（从77.47%到83.52%）方面的性能。

English

Multimodal reasoning stands as a pivotal capability for large vision-language models (LVLMs). The integration with Domain-Specific Languages (DSL), offering precise visual representations, equips these models with the opportunity to execute more accurate reasoning in complex and professional domains. However, the vanilla Chain-of-Thought (CoT) prompting method faces challenges in effectively leveraging the unique strengths of visual and DSL representations, primarily due to their differing reasoning mechanisms. Additionally, it often falls short in addressing critical steps in multi-step reasoning tasks. To mitigate these challenges, we introduce the Bi-Modal Behavioral Alignment (BBA) prompting method, designed to maximize the potential of DSL in augmenting complex multi-modal reasoning tasks. This method initiates by guiding LVLMs to create separate reasoning chains for visual and DSL representations. Subsequently, it aligns these chains by addressing any inconsistencies, thus achieving a cohesive integration of behaviors from different modalities. Our experiments demonstrate that BBA substantially improves the performance of GPT-4V(ision) on geometry problem solving (28.34% to 34.22%), chess positional advantage prediction (42.08% to 46.99%) and molecular property prediction (77.47% to 83.52%).

BBA：用于大型视觉-语言模型推理的双模态行为对齐

BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models

摘要

Support