ChatPaper.aiChatPaper

BEAR:面向原子级具身能力的多模态语言模型基准测试与性能提升

BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

October 9, 2025
作者: Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, Kai Cheng, Peiheng Wang, Jiazheng Liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Xupeng Zhu, Haojie Huang, Lawson L. S. Wong
cs.AI

摘要

具身能力是指智能体感知、理解并与物理世界互动的一系列基本能力。尽管多模态大语言模型(MLLMs)作为具身智能体展现出潜力,但对其具身能力的全面系统评估仍显不足,现有基准主要集中于特定领域,如规划或空间理解。为填补这一空白,我们推出了BEAR,一个全面且细粒度的基准,用于评估MLLMs在原子级具身能力上的表现。BEAR包含4,469个跨14个领域、6个类别的图像-视频-文本交织条目,任务范围从低层次的指向、轨迹理解、空间推理,到高层次的规划。对20个代表性MLLMs的广泛评估结果显示,它们在所有具身能力领域均存在持续局限。针对这一不足,我们提出了BEAR-Agent,一个多模态可对话智能体,它整合了预训练的视觉模型,以增强MLLM的感知、三维理解和规划能力。BEAR-Agent显著提升了MLLM在BEAR上多样具身能力的表现,实现了9.12%的绝对增益,并在GPT-5上带来了17.5%的相对提升。此外,我们的实验表明,提升MLLM的具身能力有助于在模拟环境中优化具身任务。项目网站:https://bear-official66.github.io/
English
Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: https://bear-official66.github.io/
PDF442October 13, 2025