SeePhys: 視覚は思考を助けるか？ -- 視覚ベースの物理推論のベンチマーキング

要旨

私たちは、中学から博士課程の資格試験までを網羅する物理学問題に基づいた大規模マルチモーダルベンチマーク「SeePhys」を提案します。このベンチマークは物理学分野の7つの基本領域をカバーし、21種類の高度に異質な図表を組み込んでいます。先行研究では視覚要素が主に補助的な役割を果たしていたのに対し、私たちのベンチマークでは正解を得るために視覚情報の抽出が必須となる視覚中心問題が75%を占めています。大規模な評価を通じて、最も先進的な視覚推論モデル（例：Gemini-2.5-proやo4-mini）でさえ、本ベンチマークにおいて60%未満の精度しか達成できないことが観察されました。これらの結果は、現在の大規模言語モデルの視覚理解能力における根本的な課題を明らかにしています。特に、(i) 図表の解釈と物理学的推論の間の厳密な結合を確立すること、および(ii) テキストの手がかりに依存する認知的な近道を克服することにおいて、大きな課題があることが示されています。

English

We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75\%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60\% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.

SeePhys: 視覚は思考を助けるか？ -- 視覚ベースの物理推論のベンチマーキング

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

要旨

Support