P1: 강화 학습을 활용한 물리 올림피아드 마스터하기

초록

대형 언어 모델(LLMs)의 최근 발전은 퍼즐 해결에서 과학적 수준의 추론으로의 전선을 이동시켰는데, 이는 답이 단순히 채점 기준에 부합하는 것이 아니라 자연에 맞서야 하는 문제를 해결하는 데 필요한 종류의 추론이다. 물리학은 이러한 전환을 가장 날카롭게 시험하는 분야로, 기호를 현실에 근본적으로 연결하며 대부분의 현대 기술의 초석 역할을 한다. 본 연구에서는 특히 올림피아드 수준의 물리학 문제 해결에 탁월한 물리학 추론 능력을 갖춘 대형 언어 모델을 개발하여 물리학 연구를 진전시키고자 한다. 우리는 강화 학습(RL)을 통해 전적으로 훈련된 오픈소스 물리학 추론 모델 패밀리인 P1을 소개한다. 이 중 P1-235B-A22B는 최신 국제 물리학 올림피아드(IPhO 2025)에서 금메달 성적을 거둔 첫 번째 오픈소스 모델이며, 2024/2025년에 열린 13개의 국제/지역 물리학 대회 중 12개의 금메달을 획득했다. P1-30B-A3B 또한 IPhO 2025에서 거의 모든 다른 오픈소스 모델을 능가하며 은메달을 획득했다. 에이전트 프레임워크인 PhysicsMinions를 추가로 장착한 P1-235B-A22B+PhysicsMinions는 IPhO 2025에서 종합 1위를 차지했으며, 13개의 물리학 대회에서 최고 평균 점수를 기록했다. 물리학 외에도 P1 모델들은 수학 및 코딩과 같은 다른 추론 과제에서도 뛰어난 성능을 보여주며, P1 시리즈의 뛰어난 일반화 능력을 입증한다.

English

Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.