SeePhys:視覺是否助益思考?——基於視覺的物理推理基準測試
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning
May 25, 2025
作者: Kun Xiang, Heng Li, Terry Jingchen Zhang, Yinya Huang, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, Xiaodan Liang
cs.AI
摘要
我們推出SeePhys,這是一個大規模多模態基準測試,專為基於物理學問題的大型語言模型(LLM)推理而設計,問題範圍涵蓋中學至博士資格考試。該基準涵蓋物理學科的七個基礎領域,並整合了21類高度異質的圖表。與先前研究中視覺元素主要作為輔助用途不同,我們的基準測試中視覺核心問題佔據了顯著比例(75%),這些問題要求必須提取視覺資訊才能獲得正確解答。通過廣泛評估,我們發現即使是最先進的視覺推理模型(如Gemini-2.5-pro和o4-mini)在我們的基準測試上也只能達到低於60%的準確率。這些結果揭示了當前大型語言模型在視覺理解能力上的根本挑戰,特別是在:(i)建立圖表解讀與物理推理之間的嚴密耦合,以及(ii)克服其對文本線索作為認知捷徑的持續依賴方面。
English
We present SeePhys, a large-scale multimodal benchmark for LLM reasoning
grounded in physics questions ranging from middle school to PhD qualifying
exams. The benchmark covers 7 fundamental domains spanning the physics
discipline, incorporating 21 categories of highly heterogeneous diagrams. In
contrast to prior works where visual elements mainly serve auxiliary purposes,
our benchmark features a substantial proportion of vision-essential problems
(75\%) that mandate visual information extraction for correct solutions.
Through extensive evaluation, we observe that even the most advanced visual
reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60\% accuracy
on our benchmark. These results reveal fundamental challenges in current large
language models' visual understanding capabilities, particularly in: (i)
establishing rigorous coupling between diagram interpretation and physics
reasoning, and (ii) overcoming their persistent reliance on textual cues as
cognitive shortcuts.Summary
AI-Generated Summary