ChatPaper.aiChatPaper

视觉语言模型是否具备内部世界模型?迈向原子级评估

Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation

June 27, 2025
作者: Qiyue Gao, Xinyu Pi, Kevin Liu, Junrong Chen, Ruolan Yang, Xinqi Huang, Xinyu Fang, Lu Sun, Gautham Kishore, Bo Ai, Stone Tao, Mengyang Liu, Jiaxi Yang, Chao-Jung Lai, Chuanyang Jin, Jiannan Xiang, Benhao Huang, Zeming Chen, David Danks, Hao Su, Tianmin Shu, Ziqiao Ma, Lianhui Qin, Zhiting Hu
cs.AI

摘要

内部世界模型(WMs)使智能体能够理解世界状态并预测状态转换,为高级的深思熟虑推理奠定基础。近期的大型视觉-语言模型(VLMs),如OpenAI的o3、GPT-4o和Gemini,展现出作为通用世界模型的潜力。尽管最新研究已评估并揭示了这些模型在视觉理解等特定能力上的局限,但对VLMs基本世界模型能力的系统性评估仍属空白。借鉴比较心理学与认知科学,我们提出一个两阶段评估框架,分别考察感知(视觉、空间、时间、数量及运动)与预测(机制模拟、传递推理、组合推理),以对VLMs作为世界模型进行原子级评估。在此框架指导下,我们推出了WM-ABench,这是一个大规模基准测试,包含23个细粒度评估维度,覆盖6个多样化的模拟环境,并辅以受控的反事实模拟。通过对15个最新商业及开源VLMs进行的660项实验,我们发现这些模型在基础世界建模能力上存在显著局限。例如,几乎所有模型在区分运动轨迹时准确率接近随机水平。此外,它们缺乏解耦理解能力——例如,某些模型倾向于认为蓝色物体比绿色物体移动得更快。更丰富的结果与分析揭示了VLMs与人类水平世界建模之间的显著差距。
English
Internal world models (WMs) enable agents to understand the world's state and predict transitions, serving as the basis for advanced deliberative reasoning. Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs. While the latest studies have evaluated and shown limitations in specific capabilities such as visual understanding, a systematic evaluation of VLMs' fundamental WM abilities remains absent. Drawing on comparative psychology and cognitive science, we propose a two-stage framework that assesses Perception (visual, spatial, temporal, quantitative, and motion) and Prediction (mechanistic simulation, transitive inference, compositional inference) to provide an atomic evaluation of VLMs as WMs. Guided by this framework, we introduce WM-ABench, a large-scale benchmark comprising 23 fine-grained evaluation dimensions across 6 diverse simulated environments with controlled counterfactual simulations. Through 660 experiments on 15 latest commercial and open-source VLMs, we find that these models exhibit striking limitations in basic world modeling abilities. For instance, almost all models perform at near-random accuracy when distinguishing motion trajectories. Additionally, they lack disentangled understanding -- e.g., some models tend to believe blue objects move faster than green ones. More rich results and analyses reveal significant gaps between VLMs and human-level world modeling.
PDF161June 30, 2025