探索AI推理的臨界點(CritPt):前沿物理學研究基準
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
September 30, 2025
作者: Minhui Zhu, Minyang Tian, Xiaocheng Yang, Tianci Zhou, Penghao Zhu, Eli Chertkov, Shengyan Liu, Yufeng Du, Lifan Yuan, Ziming Ji, Indranil Das, Junyi Cao, Yufeng Du, Jinchen He, Yifan Su, Jiabin Yu, Yikun Jiang, Yujie Zhang, Chang Liu, Ze-Min Huang, Weizhen Jia, Xinan Chen, Peixue Wu, Yunkai Wang, Juntai Zhou, Yong Zhao, Farshid Jafarpour, Jessie Shelton, Aaron Young, John Bartolotta, Wenchao Xu, Yue Sun, Anjun Chu, Victor Colussi, Chris Akers, Nathan Brooks, Wenbo Fu, Christopher Wilson, Jinchao Zhao, Marvin Qi, Anqi Mu, Yubo Yang, Allen Zang, Yang Lyu, Peizhi Mai, Xuefei Guo, Luyu Gao, Ze Yang, Chi Xue, Dmytro Bandak, Yaïr Hein, Yonatan Kahn, Kevin Zhou, John Drew Wilson Jarrod T. Reilly, Di Luo, Daniel Inafuku, Hao Tong, Liang Yang, Ruixing Zhang, Xueying Wang, Ofir Press, Nicolas Chia, Eliu Huerta, Hao Peng
cs.AI
摘要
儘管具備推理能力的大型語言模型(LLMs)在高中數學競賽和編程領域迅速進步,但它們能否有效應對前沿物理研究中複雜且開放式的挑戰?更重要的是,物理學家希望LLMs協助完成哪些類型的推理任務?為解答這些問題,我們提出了CritPt(Complex Research using Integrated Thinking - Physics Test,發音為“臨界點”),這是首個旨在測試LLMs在未發表、研究級推理任務上的基準,廣泛涵蓋現代物理研究領域,包括凝聚態物理、量子物理、原子、分子與光學物理、天體物理、高能物理、數學物理、統計物理、核物理、非線性動力學、流體動力學和生物物理。CritPt由71個綜合研究挑戰組成,旨在模擬入門級的全規模研究項目,並進一步分解為190個更簡單的檢查點任務,以獲得更細緻的洞察。所有問題均由50多位活躍的物理研究人員基於其自身研究全新創建。每個問題都經過精心策劃,確保答案難以猜測且可由機器驗證,並通過針對高級物理特定輸出格式深度定制的自動評分管道進行評估。我們發現,儘管當前最先進的LLMs在孤立的檢查點上展現出初步潛力,但它們仍遠未達到可靠解決完整研究規模挑戰的能力:基礎模型中的最佳平均準確率僅為4.0%,由GPT-5(高)實現,配備編碼工具後適度提升至約10%。通過CritPt提供的現實且標準化的評估,我們強調了當前模型能力與現實物理研究需求之間存在巨大差距,為開發基於科學的AI工具奠定了基礎。
English
While large language models (LLMs) with reasoning capabilities are
progressing rapidly on high-school math competitions and coding, can they
reason effectively through complex, open-ended challenges found in frontier
physics research? And crucially, what kinds of reasoning tasks do physicists
want LLMs to assist with? To address these questions, we present the CritPt
(Complex Research using Integrated Thinking - Physics Test, pronounced
"critical point"), the first benchmark designed to test LLMs on unpublished,
research-level reasoning tasks that broadly covers modern physics research
areas, including condensed matter, quantum physics, atomic, molecular & optical
physics, astrophysics, high energy physics, mathematical physics, statistical
physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics.
CritPt consists of 71 composite research challenges designed to simulate
full-scale research projects at the entry level, which are also decomposed to
190 simpler checkpoint tasks for more fine-grained insights. All problems are
newly created by 50+ active physics researchers based on their own research.
Every problem is hand-curated to admit a guess-resistant and machine-verifiable
answer and is evaluated by an automated grading pipeline heavily customized for
advanced physics-specific output formats. We find that while current
state-of-the-art LLMs show early promise on isolated checkpoints, they remain
far from being able to reliably solve full research-scale challenges: the best
average accuracy among base models is only 4.0% , achieved by GPT-5 (high),
moderately rising to around 10% when equipped with coding tools. Through the
realistic yet standardized evaluation offered by CritPt, we highlight a large
disconnect between current model capabilities and realistic physics research
demands, offering a foundation to guide the development of scientifically
grounded AI tools.