ENACT:透過自我中心互動的世界建模評估具身認知
ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
November 26, 2025
作者: Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Manling Li
cs.AI
摘要
體現認知理論主張,智能源於感知運動互動而非被動觀察。這引發了一個耐人尋味的問題:主要在非具身模式下訓練的現代視覺語言模型(VLM),是否會表現出體現認知的跡象?我們提出ENACT基準測試,將體現認知評估框架轉化為以視覺問答(VQA)形式呈現的自我中心互動世界建模。該框架採用部分可觀測馬爾可夫決策過程(POMDP),其動作表現為場景圖變化,包含兩項互補的序列重排任務:正向世界建模(根據動作重排亂序觀察結果)與逆向世界建模(根據觀察結果重排亂序動作)。雖然概念簡潔,但解決這些任務隱含需要體現認知的核心能力——從部分可觀測的自我中心輸入中進行功能可供性識別、動作效應推理、具身意識及長時程互動記憶,同時避免可能干擾評估的低層級圖像合成。我們建立可擴展流水線,從機器人模擬環境(BEHAVIOR)生成問答對,並在涵蓋長時程家庭規模活動的8,972組問答對上評估模型。實驗顯示前沿VLM與人類表現存在差距,且該差距隨互動時長增加而擴大。模型在逆向任務中的表現始終優於正向任務,並呈現出人類中心偏誤——包括偏好右手動作,以及當相機內參或視角偏離人類視覺時性能下降。項目網站:https://enact-embodied-cognition.github.io/。
English
Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.