PhyX：你的模型具备“物理推理”的智慧吗？

摘要

现有基准测试未能捕捉到智能的一个关键方面：物理推理，即整合领域知识、符号推理及对现实世界约束理解的综合能力。为填补这一空白，我们推出了PhyX：首个旨在评估模型在视觉场景中进行物理基础推理能力的大规模基准测试。PhyX包含3000道精心策划的多模态问题，涵盖6种推理类型，跨越25个子领域及6大核心物理领域：热力学、电磁学、力学、现代物理学、光学以及波与声学。在我们的全面评估中，即便是最先进的模型在物理推理上也表现欠佳，GPT-4o、Claude3.7-Sonnet和GPT-o4-mini的准确率分别仅为32.5%、42.2%和45.8%，与人类专家相比，性能差距超过29%。我们的分析揭示了当前模型的关键局限：过度依赖记忆的学科知识、过分倚重数学公式，以及停留在表面的视觉模式匹配，而非真正的物理理解。我们通过细粒度统计数据、详细案例研究和多种评估范式，深入剖析了物理推理能力。为确保可复现性，我们基于VLMEvalKit等广泛使用的工具包，实现了一键式评估的兼容协议。

English

Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5\%, 42.2\%, and 45.8\% accuracy respectively-performance gaps exceeding 29\% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation.

PhyX：你的模型具备“物理推理”的智慧吗？

PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

摘要

Support