SeePhys:视觉是否助力思维?——基于视觉的物理推理基准测试
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning
May 25, 2025
作者: Kun Xiang, Heng Li, Terry Jingchen Zhang, Yinya Huang, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, Xiaodan Liang
cs.AI
摘要
我们推出SeePhys,这是一个面向大语言模型(LLM)推理的大规模多模态基准测试,其问题涵盖从中学到博士资格考试级别的物理知识。该基准覆盖了物理学领域的7个基础方向,整合了21类高度异质的图表。与以往研究中视觉元素主要起辅助作用不同,我们的基准测试中视觉关键问题占比高达75%,这些问题必须通过视觉信息提取才能正确解答。经过广泛评估,我们发现即便是最先进的视觉推理模型(如Gemini-2.5-pro和o4-mini)在我们的基准测试上准确率也未能超过60%。这些结果揭示了当前大语言模型在视觉理解能力上的根本性挑战,特别是在:(i)建立图表解读与物理推理之间的严谨关联,以及(ii)克服其对文本线索作为认知捷径的持续依赖方面。
English
We present SeePhys, a large-scale multimodal benchmark for LLM reasoning
grounded in physics questions ranging from middle school to PhD qualifying
exams. The benchmark covers 7 fundamental domains spanning the physics
discipline, incorporating 21 categories of highly heterogeneous diagrams. In
contrast to prior works where visual elements mainly serve auxiliary purposes,
our benchmark features a substantial proportion of vision-essential problems
(75\%) that mandate visual information extraction for correct solutions.
Through extensive evaluation, we observe that even the most advanced visual
reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60\% accuracy
on our benchmark. These results reveal fundamental challenges in current large
language models' visual understanding capabilities, particularly in: (i)
establishing rigorous coupling between diagram interpretation and physics
reasoning, and (ii) overcoming their persistent reliance on textual cues as
cognitive shortcuts.Summary
AI-Generated Summary