ChatPaper.aiChatPaper

超越描述:為具身智能體建立精細動作的認知基準

Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

November 24, 2025
作者: Dayong Liu, Chao Xu, Weihong Chen, Suyu Zhang, Juncheng Wang, Jiankang Deng, Baigui Sun, Yang Liu
cs.AI

摘要

多模態大型語言模型(MLLMs)作為具身代理在複雜物理環境中的決策引擎,已展現出令人鼓舞的成果。然而現有基準測試往往側重高層次規劃或空間推理,對實現具身物理交互所需的細粒度動作智能探索不足。為填補這一空白,我們提出CFG-Bench——一個系統性評估此關鍵能力的新基準。該基準包含1,368個精選影片與19,562組三模態問答對,聚焦四項認知能力:1)物理交互 2)時序因果關係 3)意圖理解 4)評估判斷。這些維度共同構建了系統化框架,可評估模型將視覺觀察轉化為可操作知識的能力,超越表層識別層面。我們在CFG-Bench上的綜合評估表明,主流MLLMs難以生成物理交互的細節化指令,並在意圖與評估的高階推理上存在明顯局限。此外,基於本資料的監督微調(SFT)實驗證實,教導MLLMs表述細粒度動作能直接轉化為現有具身基準測試的顯著效能提升。我們的分析不僅揭示這些局限性,更為開發更具能力與實證基礎的具身代理提供了重要洞見。
English
Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents.
PDF42January 23, 2026