超越描述:為具身智能體建立精細動作的認知基準
Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents
November 24, 2025
作者: Dayong Liu, Chao Xu, Weihong Chen, Suyu Zhang, Juncheng Wang, Jiankang Deng, Baigui Sun, Yang Liu
cs.AI
摘要
多模態大型語言模型(MLLMs)作為具身代理在複雜物理環境中的決策引擎,已展現出令人鼓舞的成果。然而現有基準測試往往側重高層次規劃或空間推理,對實現具身物理交互所需的細粒度動作智能探索不足。為填補這一空白,我們提出CFG-Bench——一個系統性評估此關鍵能力的新基準。該基準包含1,368個精選影片與19,562組三模態問答對,聚焦四項認知能力:1)物理交互 2)時序因果關係 3)意圖理解 4)評估判斷。這些維度共同構建了系統化框架,可評估模型將視覺觀察轉化為可操作知識的能力,超越表層識別層面。我們在CFG-Bench上的綜合評估表明,主流MLLMs難以生成物理交互的細節化指令,並在意圖與評估的高階推理上存在明顯局限。此外,基於本資料的監督微調(SFT)實驗證實,教導MLLMs表述細粒度動作能直接轉化為現有具身基準測試的顯著效能提升。我們的分析不僅揭示這些局限性,更為開發更具能力與實證基礎的具身代理提供了重要洞見。
English
Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents.