ChatPaper.aiChatPaper

PhysReason:面向物理推理的綜合基準測試平台

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

February 17, 2025
作者: Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, Jun Liu
cs.AI

摘要

大型語言模型在多個領域展現出卓越的能力,尤其是在數學和邏輯推理方面。然而,目前的評估忽略了基於物理的推理——這是一項需要物理定理和約束的複雜任務。我們提出了PhysReason,一個包含1,200道問題的基準測試,其中知識型問題佔25%,推理型問題佔75%,後者又分為三個難度級別(簡單、中等、困難)。值得注意的是,這些問題平均需要8.1個解題步驟,而困難問題則需要15.6個步驟,反映了基於物理推理的複雜性。我們提出了物理解題自動評分框架,結合了高效的答案級評估和全面的步驟級評估。表現最佳的模型如Deepseek-R1、Gemini-2.0-Flash-Thinking和o3-mini-high在答案級評估中的得分不到60%,且從知識型問題(75.11%)到困難問題(31.95%)的表現逐漸下降。通過步驟級評估,我們識別了四個關鍵瓶頸:物理定理應用、物理過程理解、計算和物理條件分析。這些發現使PhysReason成為評估大型語言模型基於物理推理能力的新穎且全面的基準。我們的代碼和數據將發佈於https://dxzxy12138.github.io/PhysReason。
English
Large language models demonstrate remarkable capabilities across various domains, especially mathematics and logic reasoning. However, current evaluations overlook physics-based reasoning - a complex task requiring physics theorems and constraints. We present PhysReason, a 1,200-problem benchmark comprising knowledge-based (25%) and reasoning-based (75%) problems, where the latter are divided into three difficulty levels (easy, medium, hard). Notably, problems require an average of 8.1 solution steps, with hard requiring 15.6, reflecting the complexity of physics-based reasoning. We propose the Physics Solution Auto Scoring Framework, incorporating efficient answer-level and comprehensive step-level evaluations. Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation, with performance dropping from knowledge questions (75.11%) to hard problems (31.95%). Through step-level evaluation, we identified four key bottlenecks: Physics Theorem Application, Physics Process Understanding, Calculation, and Physics Condition Analysis. These findings position PhysReason as a novel and comprehensive benchmark for evaluating physics-based reasoning capabilities in large language models. Our code and data will be published at https:/dxzxy12138.github.io/PhysReason.
PDF72February 18, 2025