AI推論の臨界点（CritPt）を探る：フロンティア物理学研究ベンチマーク

要旨

大規模言語モデル（LLM）は、推論能力を備え、高校レベルの数学競技やコーディングにおいて急速に進歩していますが、フロンティア物理学研究で見られるような複雑でオープンエンドな課題を効果的に推論できるでしょうか？そして重要なのは、物理学者がLLMにどのような推論タスクを支援してほしいと考えているかです。これらの疑問に答えるため、私たちはCritPt（Complex Research using Integrated Thinking - Physics Test、「臨界点」と発音）を提案します。これは、未発表の研究レベルの推論タスクを対象とした最初のベンチマークであり、凝縮系物理、量子物理、原子・分子・光学物理、天体物理、高エネルギー物理、数理物理、統計物理、核物理、非線形力学、流体力学、生物物理など、現代物理学研究の広範な領域をカバーしています。CritPtは、エントリーレベルでの本格的な研究プロジェクトをシミュレートするために設計された71の複合研究課題で構成され、さらに詳細な洞察を得るために190のより単純なチェックポイントタスクに分解されています。すべての問題は、50人以上の現役物理研究者が自身の研究に基づいて新たに作成しました。各問題は、推測に耐え、機械検証可能な答えを許容するように手作業で選定され、高度な物理学固有の出力形式に特化してカスタマイズされた自動評価パイプラインによって評価されます。現在の最先端LLMは、個別のチェックポイントにおいて初期の有望性を示していますが、完全な研究規模の課題を確実に解決するには程遠いことがわかりました：ベースモデルの中で最高の平均精度はGPT-5（高）の4.0%に留まり、コーディングツールを装備しても約10%に中程度上昇します。CritPtが提供する現実的でありながら標準化された評価を通じて、現在のモデルの能力と現実の物理学研究の要求との間に大きな隔たりがあることを強調し、科学的に根拠のあるAIツールの開発を導く基盤を提供します。

English

While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "critical point"), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 4.0% , achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.

AI推論の臨界点（CritPt）を探る：フロンティア物理学研究ベンチマーク

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

要旨

Support