探索AI推理临界点（CritPt）：前沿物理学研究基准

摘要

尽管具备推理能力的大型语言模型（LLMs）在高中数学竞赛和编程领域进展迅速，但它们能否有效应对前沿物理研究中复杂的开放式挑战？尤为关键的是，物理学家期望LLMs协助完成哪些类型的推理任务？为解答这些问题，我们推出了CritPt（综合思维物理测试，发音同“临界点”），这是首个旨在测试LLMs在未发表的研究级推理任务上的基准，广泛覆盖了现代物理研究领域，包括凝聚态物理、量子物理、原子分子与光学物理、天体物理、高能物理、数学物理、统计物理、核物理、非线性动力学、流体动力学及生物物理。CritPt包含71项综合研究挑战，模拟入门级完整研究项目，并进一步分解为190个更简单的检查点任务，以获取更细致的洞察。所有问题均由50多位活跃的物理研究人员根据其自身研究全新设计，每道题目均经过精心筛选，确保答案难以猜测且可被机器验证，并通过高度定制化的自动化评分流程进行评估，该流程专门针对高级物理特定输出格式进行了优化。我们发现，尽管当前最先进的LLMs在独立检查点上展现出初步潜力，但它们仍远未达到可靠解决完整研究规模挑战的水平：基础模型中的最佳平均准确率仅为4.0%，由GPT-5（高）实现，配备编码工具后，这一数字适度提升至约10%。通过CritPt提供的真实且标准化的评估，我们凸显了当前模型能力与真实物理研究需求之间的巨大差距，为开发基于科学基础的AI工具奠定了指导基础。

English

While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "critical point"), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 4.0% , achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.

探索AI推理临界点（CritPt）：前沿物理学研究基准

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

摘要

Support