ChatPaper.aiChatPaper

PhyX:你的模型是否具備物理推理的「智慧」?

PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

May 21, 2025
作者: Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wendong Xu, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, Ngai Wong
cs.AI

摘要

現有的基準測試未能捕捉到智能的一個關鍵面向:物理推理,即整合領域知識、符號推理及對現實世界限制的理解的能力。為填補這一空白,我們推出了PhyX:首個大規模基準測試,旨在評估模型在視覺場景中基於物理的推理能力。PhyX包含3,000道精心策劃的多模態問題,涵蓋6種推理類型,跨越25個子領域及6個核心物理領域:熱力學、電磁學、力學、現代物理學、光學以及波與聲學。在我們的全面評估中,即便是最先進的模型在物理推理方面也表現出顯著的困難。GPT-4o、Claude3.7-Sonnet和GPT-o4-mini的準確率分別僅為32.5%、42.2%和45.8%,與人類專家相比,性能差距超過29%。我們的分析揭示了當前模型的關鍵限制:過度依賴記憶的學科知識、過分依賴數學公式,以及表面層次的視覺模式匹配,而非真正的物理理解。我們透過細緻的統計數據、詳細的案例研究及多種評估範式,提供了深入的分析,以全面檢視物理推理能力。為確保可重現性,我們基於廣泛使用的工具包(如VLMEvalKit)實現了兼容的評估協議,實現了一鍵式評估。
English
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5\%, 42.2\%, and 45.8\% accuracy respectively-performance gaps exceeding 29\% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation.

Summary

AI-Generated Summary

PDF474May 26, 2025