CMPhysBench: 凝縮系物理学における大規模言語モデルの評価のためのベンチマーク

要旨

我々は、大規模言語モデル（LLMs）の凝縮系物理学における熟練度を評価するための新たなベンチマークとして、CMPhysBenchを導入します。CMPhysBenchは、磁性、超伝導、強相関系など、凝縮系物理学の代表的なサブフィールドと基礎的な理論的枠組みをカバーする520以上の大学院レベルの厳選された問題で構成されています。問題解決プロセスの深い理解を確保するため、我々は計算問題に焦点を当て、LLMsが独立して包括的な解答を生成することを要求します。同時に、式の木構造表現を活用して、スケーラブルな式編集距離（SEED）スコアを導入し、予測と正解との間の類似性をより正確に評価するための細かい（非二値的）部分点を提供します。我々の結果は、最良のモデルであるGrok-4でさえ、CMPhysBenchにおいて平均SEEDスコア36、正答率28%に留まることを示しており、特にこの実践的で最先端の領域において、伝統的な物理学と比較して大きな能力ギャップがあることを強調しています。コードとデータセットはhttps://github.com/CMPhysBench/CMPhysBenchで公開されています。

English

We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.

CMPhysBench: 凝縮系物理学における大規模言語モデルの評価のためのベンチマーク

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

要旨

Support