AI 추론의 임계점(CritPt) 탐구: 프론티어 물리학 연구 벤치마크

초록

고등학교 수학 경시대회와 코딩 분야에서 추론 능력을 갖춘 대형 언어 모델(LLM)이 빠르게 발전하고 있지만, 이들이 첨단 물리학 연구에서 발견되는 복잡하고 개방형 도전 과제를 효과적으로 추론할 수 있을까? 그리고 무엇보다, 물리학자들은 LLM이 어떤 종류의 추론 작업을 지원하기를 원할까? 이러한 질문에 답하기 위해, 우리는 CritPt(Complex Research using Integrated Thinking - Physics Test, "크리티컬 포인트"로 발음)를 제시한다. 이는 출판되지 않은 연구 수준의 추론 과제를 테스트하기 위해 설계된 첫 번째 벤치마크로, 응집 물질, 양자 물리학, 원자·분자·광학 물리학, 천체 물리학, 고에너지 물리학, 수리 물리학, 통계 물리학, 핵 물리학, 비선형 동역학, 유체 역학 및 생물 물리학 등 현대 물리학 연구 분야를 광범위하게 다룬다. CritPt는 초급 수준의 전체 규모 연구 프로젝트를 시뮬레이션하기 위해 설계된 71개의 복합 연구 과제로 구성되어 있으며, 이를 더 세분화한 190개의 간단한 체크포인트 작업으로 분해하여 더 세밀한 통찰을 제공한다. 모든 문제는 50명 이상의 현직 물리학 연구자들이 자신의 연구를 바탕으로 새로 작성했다. 각 문제는 추측에 강하고 기계적으로 검증 가능한 답을 허용하도록 수작업으로 선별되었으며, 고급 물리학 특화 출력 형식에 맞춰 크게 커스터마이징된 자동 채점 파이프라인으로 평가된다. 우리는 현재 최첨단 LLM이 개별 체크포인트에서는 초기 가능성을 보이지만, 전체 연구 규모의 도전 과제를 안정적으로 해결하기에는 여전히 멀었다는 것을 발견했다: 기본 모델 중 가장 높은 평균 정확도는 GPT-5(고급)가 달성한 4.0%에 불과하며, 코딩 도구를 장착했을 때 약 10%로 적당히 상승한다. CritPt가 제공하는 현실적이면서도 표준화된 평가를 통해, 우리는 현재 모델의 능력과 실제 물리학 연구 요구 사이의 큰 격차를 강조하며, 과학적으로 근거 있는 AI 도구 개발을 안내할 기반을 마련한다.

English

While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "critical point"), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 4.0% , achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.

AI 추론의 임계점(CritPt) 탐구: 프론티어 물리학 연구 벤치마크

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

초록

Support