SeePhys：視覺是否助益思考？——基於視覺的物理推理基準測試

摘要

我們推出SeePhys，這是一個大規模多模態基準測試，專為基於物理學問題的大型語言模型（LLM）推理而設計，問題範圍涵蓋中學至博士資格考試。該基準涵蓋物理學科的七個基礎領域，並整合了21類高度異質的圖表。與先前研究中視覺元素主要作為輔助用途不同，我們的基準測試中視覺核心問題佔據了顯著比例（75%），這些問題要求必須提取視覺資訊才能獲得正確解答。通過廣泛評估，我們發現即使是最先進的視覺推理模型（如Gemini-2.5-pro和o4-mini）在我們的基準測試上也只能達到低於60%的準確率。這些結果揭示了當前大型語言模型在視覺理解能力上的根本挑戰，特別是在：（i）建立圖表解讀與物理推理之間的嚴密耦合，以及（ii）克服其對文本線索作為認知捷徑的持續依賴方面。

English

We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75\%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60\% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.

SeePhys：視覺是否助益思考？——基於視覺的物理推理基準測試

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

摘要

Support