CMPhysBench：評估大型語言模型在凝聚態物理領域表現的基準

摘要

我們推出了CMPhysBench，這是一個旨在評估大型語言模型（LLMs）在凝聚態物理學領域能力的新穎基準。CMPhysBench由超過520道研究生級別精心策劃的問題組成，涵蓋了凝聚態物理學的代表性子領域和基礎理論框架，如磁性、超導性、強關聯系統等。為了確保對問題解決過程的深入理解，我們專注於計算問題，要求LLMs獨立生成全面的解決方案。同時，利用表達式的樹狀表示，我們引入了可擴展表達式編輯距離（SEED）分數，該分數提供了細粒度的（非二元的）部分分數，並能更準確地評估預測與真實值之間的相似性。我們的結果顯示，即使是表現最佳的模型Grok-4，在CMPhysBench上的平均SEED分數僅為36，準確率為28%，這凸顯了在這一實用且前沿的領域相對於傳統物理學的顯著能力差距。代碼和數據集已公開於https://github.com/CMPhysBench/CMPhysBench。

English

We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.

CMPhysBench：評估大型語言模型在凝聚態物理領域表現的基準

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

摘要

Support