BBA: 大規模視覚言語モデルにおける推論のための双方向行動アラインメント

要旨

マルチモーダル推論は、大規模視覚言語モデル（LVLM）にとって重要な能力である。ドメイン固有言語（DSL）との統合は、正確な視覚的表現を提供することで、これらのモデルが複雑で専門的な領域においてより正確な推論を実行する機会を与える。しかし、従来のChain-of-Thought（CoT）プロンプティング手法は、視覚的表現とDSL表現の異なる推論メカニズムを効果的に活用する上で課題に直面しており、特に多段階推論タスクにおける重要なステップを十分に扱えないことが多い。これらの課題を緩和するため、我々はBi-Modal Behavioral Alignment（BBA）プロンプティング手法を提案する。この手法は、DSLの潜在能力を最大限に活用して複雑なマルチモーダル推論タスクを強化することを目的としている。具体的には、まずLVLMに視覚的表現とDSL表現のための別々の推論チェーンを作成させ、その後、これらのチェーンを整合させて異なるモダリティからの行動を統合する。実験結果から、BBAはGPT-4V(ision)の幾何学問題解決（28.34%から34.22%）、チェスのポジション優位性予測（42.08%から46.99%）、分子特性予測（77.47%から83.52%）において大幅な性能向上をもたらすことが示された。

English

Multimodal reasoning stands as a pivotal capability for large vision-language models (LVLMs). The integration with Domain-Specific Languages (DSL), offering precise visual representations, equips these models with the opportunity to execute more accurate reasoning in complex and professional domains. However, the vanilla Chain-of-Thought (CoT) prompting method faces challenges in effectively leveraging the unique strengths of visual and DSL representations, primarily due to their differing reasoning mechanisms. Additionally, it often falls short in addressing critical steps in multi-step reasoning tasks. To mitigate these challenges, we introduce the Bi-Modal Behavioral Alignment (BBA) prompting method, designed to maximize the potential of DSL in augmenting complex multi-modal reasoning tasks. This method initiates by guiding LVLMs to create separate reasoning chains for visual and DSL representations. Subsequently, it aligns these chains by addressing any inconsistencies, thus achieving a cohesive integration of behaviors from different modalities. Our experiments demonstrate that BBA substantially improves the performance of GPT-4V(ision) on geometry problem solving (28.34% to 34.22%), chess positional advantage prediction (42.08% to 46.99%) and molecular property prediction (77.47% to 83.52%).

BBA: 大規模視覚言語モデルにおける推論のための双方向行動アラインメント

BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models

要旨

Support