ChatPaper.aiChatPaper

SeePhys:視覺是否助益思考?——基於視覺的物理推理基準測試

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

May 25, 2025
作者: Kun Xiang, Heng Li, Terry Jingchen Zhang, Yinya Huang, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, Xiaodan Liang
cs.AI

摘要

我們推出SeePhys,這是一個大規模多模態基準測試,專為基於物理學問題的大型語言模型(LLM)推理而設計,問題範圍涵蓋中學至博士資格考試。該基準涵蓋物理學科的七個基礎領域,並整合了21類高度異質的圖表。與先前研究中視覺元素主要作為輔助用途不同,我們的基準測試中視覺核心問題佔據了顯著比例(75%),這些問題要求必須提取視覺資訊才能獲得正確解答。通過廣泛評估,我們發現即使是最先進的視覺推理模型(如Gemini-2.5-pro和o4-mini)在我們的基準測試上也只能達到低於60%的準確率。這些結果揭示了當前大型語言模型在視覺理解能力上的根本挑戰,特別是在:(i)建立圖表解讀與物理推理之間的嚴密耦合,以及(ii)克服其對文本線索作為認知捷徑的持續依賴方面。
English
We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75\%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60\% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.

Summary

AI-Generated Summary

PDF83May 28, 2025