BiManiBench：评估多模态大语言模型双手协调能力的层级化基准

摘要

多模态大语言模型（MLLMs）显著推动了具身智能的发展，将其作为机器人智能的基准测试工具已成为关键趋势。然而现有框架主要局限于单臂操作任务，未能体现双手协作（如抬起重锅）所需的时空协调能力。为此，我们提出分层基准测试框架BiManiBench，从三个层面评估MLLMs：基础空间推理能力、高层动作规划能力及低层末端执行器控制能力。该框架通过隔离双臂可达性、运动学约束等独特挑战，有效区分感知幻觉与规划失误。对30余个前沿模型的分析表明：尽管MLLMs具备较强的高层推理能力，但在双臂空间定位与控制方面表现不佳，频繁出现双臂相互干扰和动作时序错误。这些发现揭示出现有范式对运动学互约束关系缺乏深层理解，未来研究需重点关注双臂防碰撞机制与细粒度时序规划。

English

Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.

BiManiBench：评估多模态大语言模型双手协调能力的层级化基准

BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

摘要

Support