DeepPHY：评估具身视觉语言模型在物理推理上的基准测试

摘要

尽管视觉语言模型（VLMs）展现出强大的感知能力和令人印象深刻的视觉推理，但在复杂动态环境中，它们对细节的关注和精确行动规划方面仍显不足，导致表现欠佳。现实世界的任务通常需要复杂的交互、高级的空间推理、长期规划以及持续的策略优化，这往往要求理解目标场景的物理规则。然而，在现实场景中评估这些能力往往成本过高。为弥合这一差距，我们提出了DeepPHY，一个新颖的基准框架，旨在通过一系列具有挑战性的模拟环境，系统性地评估VLMs对基本物理原理的理解与推理能力。DeepPHY整合了多个难度各异的物理推理环境，并融入了细粒度的评估指标。我们的评估发现，即便是最先进的VLMs，也难以将描述性的物理知识转化为精确的预测性控制。

English

Although Vision Language Models (VLMs) exhibit strong perceptual abilities and impressive visual reasoning, they struggle with attention to detail and precise action planning in complex, dynamic environments, leading to subpar performance. Real-world tasks typically require complex interactions, advanced spatial reasoning, long-term planning, and continuous strategy refinement, usually necessitating understanding the physics rules of the target scenario. However, evaluating these capabilities in real-world scenarios is often prohibitively expensive. To bridge this gap, we introduce DeepPHY, a novel benchmark framework designed to systematically evaluate VLMs' understanding and reasoning about fundamental physical principles through a series of challenging simulated environments. DeepPHY integrates multiple physical reasoning environments of varying difficulty levels and incorporates fine-grained evaluation metrics. Our evaluation finds that even state-of-the-art VLMs struggle to translate descriptive physical knowledge into precise, predictive control.

DeepPHY：评估具身视觉语言模型在物理推理上的基准测试

DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

摘要

Support