BiManiBench：マルチモーダル大規模言語モデルの両手協調動作を評価するための階層的ベンチマーク

要旨

マルチモーダル大規模言語モデル（MLLM）はエンボディードAIを著しく進歩させ、それらを用いたロボット知能のベンチマーク化は重要な趨勢となっている。しかし、既存のフレームワークは依然として単腕マニピュレーションに偏重しており、重い鍋を持ち上げるような両手作業に必要とされる時空間的協調を捉えられていない。この問題に対処するため、我々はBiManiBenchを提案する。これは基礎的な空間推論、高次元の行動計画、低次元のエンドエフェクタ制御という3層でMLLMを評価する階層的ベンチマークである。本フレームワークは、腕の到達可能性や運動学的制約といった両手操作に特有の課題を分離し、知覚的な幻覚と計画の失敗を区別する。30以上の最先端モデルを分析した結果、MLLMは高次元の推論能力には優れるものの、両腕の空間的接地と制御に課題があり、相互干渉や順序付け誤りが頻発することが明らかとなった。これらの知見は、現在のパラダイムが相互の運動学的制約に対する深い理解を欠いており、今後の研究が腕間衝突回避や細粒度の時系列順序付けに焦点を当てる必要性を示唆している。

English

Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.

BiManiBench：マルチモーダル大規模言語モデルの両手協調動作を評価するための階層的ベンチマーク

BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

要旨

Support