ChatPaper.aiChatPaper

ENACT:基于自我中心交互世界建模的具身认知评估

ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

November 26, 2025
作者: Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Manling Li
cs.AI

摘要

具身认知理论主张智能源于感知运动交互而非被动观察。这引发了一个耐人寻味的问题:主要在非具身方式下训练的现代视觉语言模型(VLM)是否展现出具身认知的特征?我们推出ENACT基准测试,通过视觉问答(VQA)形式将具身认知评估转化为基于第一人称交互的世界建模。该框架被构建为动作即场景图变化的部分可观测马尔可夫决策过程(POMDP),包含两项互补的序列重组任务:前向世界建模(根据动作重排乱序观察)和逆向世界建模(根据观察重排乱序动作)。这些任务虽概念简洁,但求解过程隐含着对具身认知核心能力的要求——可供性识别、动作效果推理、具身意识,以及从部分可观测的第一人称输入中实现交互式长时程记忆,同时规避可能干扰评估的低层级图像合成。我们开发了可扩展流程,从机器人仿真平台(BEHAVIOR)生成问答对,并在涵盖长时程家庭场景活动的8,972组问答对上评估模型。实验显示前沿VLM与人类表现存在差距,且该差距随交互时长增加而扩大。模型在逆向任务中的表现持续优于前向任务,并显现出人类中心偏见——包括对右手动作的偏好,以及当相机参数或视角偏离人类视觉时性能下降。项目网站:https://enact-embodied-cognition.github.io/。
English
Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.
PDF112December 1, 2025