BiManiBench:评估多模态大语言模型双手协调能力的层级化基准
BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models
February 9, 2026
作者: Xin Wu, Zhixuan Liang, Yue Ma, Mengkang Hu, Zhiyuan Qin, Xiu Li
cs.AI
摘要
多模态大语言模型(MLLMs)显著推动了具身智能的发展,将其作为机器人智能的基准测试工具已成为关键趋势。然而现有框架主要局限于单臂操作任务,未能体现双手协作(如抬起重锅)所需的时空协调能力。为此,我们提出分层基准测试框架BiManiBench,从三个层面评估MLLMs:基础空间推理能力、高层动作规划能力及低层末端执行器控制能力。该框架通过隔离双臂可达性、运动学约束等独特挑战,有效区分感知幻觉与规划失误。对30余个前沿模型的分析表明:尽管MLLMs具备较强的高层推理能力,但在双臂空间定位与控制方面表现不佳,频繁出现双臂相互干扰和动作时序错误。这些发现揭示出现有范式对运动学互约束关系缺乏深层理解,未来研究需重点关注双臂防碰撞机制与细粒度时序规划。
English
Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.