CMPhysBench: 응집물질 물리학에서의 대형 언어 모델 평가를 위한 벤치마크

초록

우리는 대형 언어 모델(LLMs)의 응집물리학 분야 숙련도를 평가하기 위해 CMPhysBench이라는 새로운 벤치마크를 소개합니다. CMPhysBench은 자성, 초전도성, 강상관계 시스템 등 응집물리학의 대표적인 하위 분야와 기초 이론 프레임워크를 포괄하는 520개 이상의 대학원 수준의 정교하게 선별된 문제들로 구성되어 있습니다. 문제 해결 과정에 대한 깊은 이해를 보장하기 위해, 우리는 계산 문제에만 초점을 맞추어 LLMs가 독립적으로 포괄적인 해결책을 생성하도록 요구합니다. 동시에, 표현식의 트리 기반 표현을 활용하여, 예측과 정답 간의 유사성을 더 정확하게 평가할 수 있는 세분화된(이진이 아닌) 부분 점수를 제공하는 확장 가능한 표현식 편집 거리(SEED) 점수를 도입했습니다. 우리의 결과에 따르면, 가장 우수한 모델인 Grok-4조차도 CMPhysBench에서 평균 SEED 점수 36점과 28%의 정확도를 기록하며, 특히 전통적인 물리학에 비해 이 실용적이고 첨단 분야에서 상당한 능력 격차가 있음을 보여줍니다. 코드와 데이터셋은 https://github.com/CMPhysBench/CMPhysBench에서 공개적으로 이용 가능합니다.

English

We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.

CMPhysBench: 응집물질 물리학에서의 대형 언어 모델 평가를 위한 벤치마크

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

초록

Support