探索AI推理的臨界點（CritPt）：前沿物理學研究基準

摘要

儘管具備推理能力的大型語言模型（LLMs）在高中數學競賽和編程領域迅速進步，但它們能否有效應對前沿物理研究中複雜且開放式的挑戰？更重要的是，物理學家希望LLMs協助完成哪些類型的推理任務？為解答這些問題，我們提出了CritPt（Complex Research using Integrated Thinking - Physics Test，發音為“臨界點”），這是首個旨在測試LLMs在未發表、研究級推理任務上的基準，廣泛涵蓋現代物理研究領域，包括凝聚態物理、量子物理、原子、分子與光學物理、天體物理、高能物理、數學物理、統計物理、核物理、非線性動力學、流體動力學和生物物理。CritPt由71個綜合研究挑戰組成，旨在模擬入門級的全規模研究項目，並進一步分解為190個更簡單的檢查點任務，以獲得更細緻的洞察。所有問題均由50多位活躍的物理研究人員基於其自身研究全新創建。每個問題都經過精心策劃，確保答案難以猜測且可由機器驗證，並通過針對高級物理特定輸出格式深度定制的自動評分管道進行評估。我們發現，儘管當前最先進的LLMs在孤立的檢查點上展現出初步潛力，但它們仍遠未達到可靠解決完整研究規模挑戰的能力：基礎模型中的最佳平均準確率僅為4.0%，由GPT-5（高）實現，配備編碼工具後適度提升至約10%。通過CritPt提供的現實且標準化的評估，我們強調了當前模型能力與現實物理研究需求之間存在巨大差距，為開發基於科學的AI工具奠定了基礎。

English

While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "critical point"), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 4.0% , achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.

探索AI推理的臨界點（CritPt）：前沿物理學研究基準

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

摘要

Support