물리 시뮬레이터 기반 강화 학습을 통한 물리 올림피아드 문제 해결

초록

DeepSeek-R1의 등장으로 우리는 LLM의 추론 능력이 눈에 띄게 발전하는 것을 목격했습니다. 그러나 이러한 진전의 상당 부분은 인터넷 질문-답변(QA) 쌍의 풍부함에 기인한 것으로, 이러한 데이터는 규모 측면에서 제한적이며 주로 수학과 같은 영역에 집중되어 있어 향후 주요 병목 현상으로 작용하고 있습니다. 이에 반해 물리학과 같은 다른 과학 분야에서는 추론 능력을 갖춘 모델을 효과적으로 훈련시키기에 충분한 대규모 QA 데이터셋이 부족한 실정입니다. 본 연구에서는 물리학 시뮬레이터가 물리적 추론을 위한 LLM 훈련을 위한 강력한 대체 감독 소스로 기능할 수 있음을 보여줍니다. 우리는 물리 엔진 내에서 무작위 장면을 생성하고, 시뮬레이션된 상호작용으로부터 합성 질문-답변 쌍을 생성하며, 이 합성 데이터에 대해 강화 학습을 통해 LLM을 훈련시킵니다. 우리의 모델은 실제 물리학 벤치마크에 대해 제로샷 시뮬레이션-투-리얼 전이를 보여줍니다: 예를 들어, 합성 시뮬레이션 데이터만으로 훈련했을 때 모델 크기 전체에 걸쳐 IPhO(국제 물리 올림피아드) 문제 성능이 5-10%p 향상되었습니다. 이러한 결과는 물리학 시뮬레이터가 확장 가능한 데이터 생성기로 작용하여 인터넷 규모의 QA 데이터의 한계를 넘어선 깊은 물리적 추론 능력을 LLM이 습득할 수 있게 함을 입증합니다. 코드는 https://sim2reason.github.io/에서 확인할 수 있습니다.

English

We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question-answer (QA) pairs, a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5-10 percentage points across model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data. Code available at: https://sim2reason.github.io/.

물리 시뮬레이터 기반 강화 학습을 통한 물리 올림피아드 문제 해결

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

초록

Support