探索AI推理临界点(CritPt):前沿物理学研究基准
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
September 30, 2025
作者: Minhui Zhu, Minyang Tian, Xiaocheng Yang, Tianci Zhou, Penghao Zhu, Eli Chertkov, Shengyan Liu, Yufeng Du, Lifan Yuan, Ziming Ji, Indranil Das, Junyi Cao, Yufeng Du, Jinchen He, Yifan Su, Jiabin Yu, Yikun Jiang, Yujie Zhang, Chang Liu, Ze-Min Huang, Weizhen Jia, Xinan Chen, Peixue Wu, Yunkai Wang, Juntai Zhou, Yong Zhao, Farshid Jafarpour, Jessie Shelton, Aaron Young, John Bartolotta, Wenchao Xu, Yue Sun, Anjun Chu, Victor Colussi, Chris Akers, Nathan Brooks, Wenbo Fu, Christopher Wilson, Jinchao Zhao, Marvin Qi, Anqi Mu, Yubo Yang, Allen Zang, Yang Lyu, Peizhi Mai, Xuefei Guo, Luyu Gao, Ze Yang, Chi Xue, Dmytro Bandak, Yaïr Hein, Yonatan Kahn, Kevin Zhou, John Drew Wilson Jarrod T. Reilly, Di Luo, Daniel Inafuku, Hao Tong, Liang Yang, Ruixing Zhang, Xueying Wang, Ofir Press, Nicolas Chia, Eliu Huerta, Hao Peng
cs.AI
摘要
尽管具备推理能力的大型语言模型(LLMs)在高中数学竞赛和编程领域进展迅速,但它们能否有效应对前沿物理研究中复杂的开放式挑战?尤为关键的是,物理学家期望LLMs协助完成哪些类型的推理任务?为解答这些问题,我们推出了CritPt(综合思维物理测试,发音同“临界点”),这是首个旨在测试LLMs在未发表的研究级推理任务上的基准,广泛覆盖了现代物理研究领域,包括凝聚态物理、量子物理、原子分子与光学物理、天体物理、高能物理、数学物理、统计物理、核物理、非线性动力学、流体动力学及生物物理。CritPt包含71项综合研究挑战,模拟入门级完整研究项目,并进一步分解为190个更简单的检查点任务,以获取更细致的洞察。所有问题均由50多位活跃的物理研究人员根据其自身研究全新设计,每道题目均经过精心筛选,确保答案难以猜测且可被机器验证,并通过高度定制化的自动化评分流程进行评估,该流程专门针对高级物理特定输出格式进行了优化。我们发现,尽管当前最先进的LLMs在独立检查点上展现出初步潜力,但它们仍远未达到可靠解决完整研究规模挑战的水平:基础模型中的最佳平均准确率仅为4.0%,由GPT-5(高)实现,配备编码工具后,这一数字适度提升至约10%。通过CritPt提供的真实且标准化的评估,我们凸显了当前模型能力与真实物理研究需求之间的巨大差距,为开发基于科学基础的AI工具奠定了指导基础。
English
While large language models (LLMs) with reasoning capabilities are
progressing rapidly on high-school math competitions and coding, can they
reason effectively through complex, open-ended challenges found in frontier
physics research? And crucially, what kinds of reasoning tasks do physicists
want LLMs to assist with? To address these questions, we present the CritPt
(Complex Research using Integrated Thinking - Physics Test, pronounced
"critical point"), the first benchmark designed to test LLMs on unpublished,
research-level reasoning tasks that broadly covers modern physics research
areas, including condensed matter, quantum physics, atomic, molecular & optical
physics, astrophysics, high energy physics, mathematical physics, statistical
physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics.
CritPt consists of 71 composite research challenges designed to simulate
full-scale research projects at the entry level, which are also decomposed to
190 simpler checkpoint tasks for more fine-grained insights. All problems are
newly created by 50+ active physics researchers based on their own research.
Every problem is hand-curated to admit a guess-resistant and machine-verifiable
answer and is evaluated by an automated grading pipeline heavily customized for
advanced physics-specific output formats. We find that while current
state-of-the-art LLMs show early promise on isolated checkpoints, they remain
far from being able to reliably solve full research-scale challenges: the best
average accuracy among base models is only 4.0% , achieved by GPT-5 (high),
moderately rising to around 10% when equipped with coding tools. Through the
realistic yet standardized evaluation offered by CritPt, we highlight a large
disconnect between current model capabilities and realistic physics research
demands, offering a foundation to guide the development of scientifically
grounded AI tools.