CMPhysBench:评估大语言模型在凝聚态物理领域表现的基准
CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
August 25, 2025
作者: Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Yunqi Cai, Xi Dai, Shufei Zhang, Lei Bai, Jinguang Cheng, Zhong Fang, Hongming Weng
cs.AI
摘要
我们推出了CMPhysBench,作为一项新颖的基准测试,旨在评估大语言模型(LLMs)在凝聚态物理领域的熟练程度。CMPhysBench包含了超过520道研究生级别的精心设计的问题,涵盖了凝聚态物理中的代表性子领域及基础理论框架,如磁性、超导性、强关联系统等。为了确保对问题解决过程的深入理解,我们专注于计算题,要求LLMs独立生成完整的解答。同时,利用表达式的树状表示,我们引入了可扩展表达式编辑距离(SEED)评分,该评分提供细粒度(非二元的)部分得分,从而更准确地评估预测与真实答案之间的相似度。我们的结果显示,即便是表现最佳的模型Grok-4,在CMPhysBench上的平均SEED得分仅为36,准确率仅为28%,这凸显了在该实践性与前沿性领域,相较于传统物理学,大语言模型存在显著的能力差距。代码与数据集已公开于https://github.com/CMPhysBench/CMPhysBench。
English
We introduce CMPhysBench, designed to assess the proficiency of Large
Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark.
CMPhysBench is composed of more than 520 graduate-level meticulously curated
questions covering both representative subfields and foundational theoretical
frameworks of condensed matter physics, such as magnetism, superconductivity,
strongly correlated systems, etc. To ensure a deep understanding of the
problem-solving process,we focus exclusively on calculation problems, requiring
LLMs to independently generate comprehensive solutions. Meanwhile, leveraging
tree-based representations of expressions, we introduce the Scalable Expression
Edit Distance (SEED) score, which provides fine-grained (non-binary) partial
credit and yields a more accurate assessment of similarity between prediction
and ground-truth. Our results show that even the best models, Grok-4, reach
only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a
significant capability gap, especially for this practical and frontier domain
relative to traditional physics. The code anddataset are publicly available at
https://github.com/CMPhysBench/CMPhysBench.