BBA:用於與大型視覺語言模型推理的雙模態行為對齊
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models
February 21, 2024
作者: Xueliang Zhao, Xinting Huang, Tingchen Fu, Qintong Li, Shansan Gong, Lemao Liu, Wei Bi, Lingpeng Kong
cs.AI
摘要
多模態推理是大視覺語言模型(LVLMs)的關鍵能力。與特定領域語言(DSL)集成,提供精確的視覺表示,使這些模型有機會在複雜和專業領域中執行更準確的推理。然而,基本的思維鏈(CoT)提示方法在有效利用視覺和DSL表示的獨特優勢方面面臨挑戰,主要是由於它們不同的推理機制。此外,在處理多步推理任務中的關鍵步驟時,它通常表現不佳。為了克服這些挑戰,我們引入了雙模態行為對齊(BBA)提示方法,旨在最大程度地發揮DSL在增強複雜多模態推理任務中的潛力。該方法開始引導LVLMs為視覺和DSL表示創建獨立的推理鏈。隨後,通過解決任何不一致之處,將這些鏈進行對齊,從而實現來自不同模態的行為的統一集成。我們的實驗表明,BBA顯著提高了GPT-4V(ision)在幾何問題解決(從28.34%到34.22%)、棋局優勢預測(從42.08%到46.99%)和分子性質預測(從77.47%到83.52%)方面的性能。
English
Multimodal reasoning stands as a pivotal capability for large vision-language
models (LVLMs). The integration with Domain-Specific Languages (DSL), offering
precise visual representations, equips these models with the opportunity to
execute more accurate reasoning in complex and professional domains. However,
the vanilla Chain-of-Thought (CoT) prompting method faces challenges in
effectively leveraging the unique strengths of visual and DSL representations,
primarily due to their differing reasoning mechanisms. Additionally, it often
falls short in addressing critical steps in multi-step reasoning tasks. To
mitigate these challenges, we introduce the Bi-Modal
Behavioral Alignment (BBA) prompting method, designed
to maximize the potential of DSL in augmenting complex multi-modal reasoning
tasks. This method initiates by guiding LVLMs to create separate reasoning
chains for visual and DSL representations. Subsequently, it aligns these chains
by addressing any inconsistencies, thus achieving a cohesive integration of
behaviors from different modalities. Our experiments demonstrate that BBA
substantially improves the performance of GPT-4V(ision) on geometry problem
solving (28.34% to 34.22%), chess positional advantage prediction
(42.08% to 46.99%) and molecular property prediction (77.47% to
83.52%).