超越描述:为具身智能体建立细粒度动作的认知基准
Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents
November 24, 2025
作者: Dayong Liu, Chao Xu, Weihong Chen, Suyu Zhang, Juncheng Wang, Jiankang Deng, Baigui Sun, Yang Liu
cs.AI
摘要
多模态大语言模型(MLLMs)作为具身智能体在复杂物理环境中的决策引擎展现出巨大潜力。然而,现有基准测试往往侧重于高层规划或空间推理能力,对实现具身物理交互所需的细粒度动作智能探索不足。为弥补这一空白,我们推出CFG-Bench——一个用于系统评估该关键能力的新型基准测试。该基准包含1,368个精选视频及与之配对的19,562组三模态问答对,聚焦四大认知能力:1)物理交互 2)时序因果关系 3)意图理解 4)价值评判。这些维度共同构建了系统化评估框架,可检验模型将视觉观察转化为可执行知识的能力,而非仅停留在表层识别。我们在CFG-Bench上的综合评估表明:主流MLLMs难以生成物理交互的详细指令,在意图理解和价值评判等高阶推理方面存在显著局限。此外,基于本数据的监督微调(SFT)实验证明,教导MLLMs描述细粒度动作能直接转化为在成熟具身基准测试上的显著性能提升。我们的分析不仅揭示了这些局限性,更为开发更具能力的接地气具身智能体提供了重要洞见。
English
Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents.