BBA：用於與大型視覺語言模型推理的雙模態行為對齊

摘要

多模態推理是大視覺語言模型（LVLMs）的關鍵能力。與特定領域語言（DSL）集成，提供精確的視覺表示，使這些模型有機會在複雜和專業領域中執行更準確的推理。然而，基本的思維鏈（CoT）提示方法在有效利用視覺和DSL表示的獨特優勢方面面臨挑戰，主要是由於它們不同的推理機制。此外，在處理多步推理任務中的關鍵步驟時，它通常表現不佳。為了克服這些挑戰，我們引入了雙模態行為對齊（BBA）提示方法，旨在最大程度地發揮DSL在增強複雜多模態推理任務中的潛力。該方法開始引導LVLMs為視覺和DSL表示創建獨立的推理鏈。隨後，通過解決任何不一致之處，將這些鏈進行對齊，從而實現來自不同模態的行為的統一集成。我們的實驗表明，BBA顯著提高了GPT-4V（ision）在幾何問題解決（從28.34%到34.22%）、棋局優勢預測（從42.08%到46.99%）和分子性質預測（從77.47%到83.52%）方面的性能。

English

Multimodal reasoning stands as a pivotal capability for large vision-language models (LVLMs). The integration with Domain-Specific Languages (DSL), offering precise visual representations, equips these models with the opportunity to execute more accurate reasoning in complex and professional domains. However, the vanilla Chain-of-Thought (CoT) prompting method faces challenges in effectively leveraging the unique strengths of visual and DSL representations, primarily due to their differing reasoning mechanisms. Additionally, it often falls short in addressing critical steps in multi-step reasoning tasks. To mitigate these challenges, we introduce the Bi-Modal Behavioral Alignment (BBA) prompting method, designed to maximize the potential of DSL in augmenting complex multi-modal reasoning tasks. This method initiates by guiding LVLMs to create separate reasoning chains for visual and DSL representations. Subsequently, it aligns these chains by addressing any inconsistencies, thus achieving a cohesive integration of behaviors from different modalities. Our experiments demonstrate that BBA substantially improves the performance of GPT-4V(ision) on geometry problem solving (28.34% to 34.22%), chess positional advantage prediction (42.08% to 46.99%) and molecular property prediction (77.47% to 83.52%).

BBA：用於與大型視覺語言模型推理的雙模態行為對齊

BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models

摘要

Support