DeepPHY:评估具身视觉语言模型在物理推理上的基准测试
DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning
August 7, 2025
作者: Xinrun Xu, Pi Bu, Ye Wang, Börje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, Bo Zheng
cs.AI
摘要
尽管视觉语言模型(VLMs)展现出强大的感知能力和令人印象深刻的视觉推理,但在复杂动态环境中,它们对细节的关注和精确行动规划方面仍显不足,导致表现欠佳。现实世界的任务通常需要复杂的交互、高级的空间推理、长期规划以及持续的策略优化,这往往要求理解目标场景的物理规则。然而,在现实场景中评估这些能力往往成本过高。为弥合这一差距,我们提出了DeepPHY,一个新颖的基准框架,旨在通过一系列具有挑战性的模拟环境,系统性地评估VLMs对基本物理原理的理解与推理能力。DeepPHY整合了多个难度各异的物理推理环境,并融入了细粒度的评估指标。我们的评估发现,即便是最先进的VLMs,也难以将描述性的物理知识转化为精确的预测性控制。
English
Although Vision Language Models (VLMs) exhibit strong perceptual abilities
and impressive visual reasoning, they struggle with attention to detail and
precise action planning in complex, dynamic environments, leading to subpar
performance. Real-world tasks typically require complex interactions, advanced
spatial reasoning, long-term planning, and continuous strategy refinement,
usually necessitating understanding the physics rules of the target scenario.
However, evaluating these capabilities in real-world scenarios is often
prohibitively expensive. To bridge this gap, we introduce DeepPHY, a novel
benchmark framework designed to systematically evaluate VLMs' understanding and
reasoning about fundamental physical principles through a series of challenging
simulated environments. DeepPHY integrates multiple physical reasoning
environments of varying difficulty levels and incorporates fine-grained
evaluation metrics. Our evaluation finds that even state-of-the-art VLMs
struggle to translate descriptive physical knowledge into precise, predictive
control.