PHYBench: 大規模言語モデルにおける物理的知覚と推論の包括的評価

要旨

我々は、大規模言語モデル（LLM）の物理的コンテキストにおける推論能力を評価するために設計された、新規で高品質なベンチマーク「PHYBench」を紹介する。PHYBenchは、現実世界の物理シナリオに基づいて厳選された500の物理問題で構成されており、モデルが現実的な物理プロセスを理解し推論する能力を評価することを目的としている。力学、電磁気学、熱力学、光学、現代物理学、および高度な物理学をカバーし、高校レベルの演習から大学レベルの問題、さらには物理オリンピックの課題まで、難易度の幅広い問題を網羅している。さらに、我々は数式間の編集距離に基づく新規の評価指標「Expression Edit Distance（EED）スコア」を提案し、従来の二値スコアリング手法を超えて、モデルの推論プロセスと結果の差異を効果的に捉える。我々はPHYBench上で様々なLLMを評価し、その性能を人間の専門家と比較する。結果は、最先端の推論モデルでさえ人間の専門家に大きく遅れをとっていることを明らかにし、複雑な物理推論シナリオにおけるその限界と改善の必要性を浮き彫りにしている。我々のベンチマーク結果とデータセットは、https://phybench-official.github.io/phybench-demo/ で公開されている。

English

We introduce PHYBench, a novel, high-quality benchmark designed for evaluating reasoning capabilities of large language models (LLMs) in physical contexts. PHYBench consists of 500 meticulously curated physics problems based on real-world physical scenarios, designed to assess the ability of models to understand and reason about realistic physical processes. Covering mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics, the benchmark spans difficulty levels from high school exercises to undergraduate problems and Physics Olympiad challenges. Additionally, we propose the Expression Edit Distance (EED) Score, a novel evaluation metric based on the edit distance between mathematical expressions, which effectively captures differences in model reasoning processes and results beyond traditional binary scoring methods. We evaluate various LLMs on PHYBench and compare their performance with human experts. Our results reveal that even state-of-the-art reasoning models significantly lag behind human experts, highlighting their limitations and the need for improvement in complex physical reasoning scenarios. Our benchmark results and dataset are publicly available at https://phybench-official.github.io/phybench-demo/.

PHYBench: 大規模言語モデルにおける物理的知覚と推論の包括的評価

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

要旨

Support