基于物理模拟器的强化学习求解物理奥林匹克竞赛问题

摘要

随着DeepSeek-R1的问世，我们见证了大型语言模型推理能力的显著进步。然而，这一进展很大程度上得益于互联网上大量问答对的数据支持，但这类数据在规模上存在局限且主要集中在数学等领域，正逐渐成为发展的瓶颈。相比之下，物理学等其他科学领域缺乏大规模问答数据集来有效训练具备推理能力的模型。本研究证明，物理模拟器可作为训练物理推理LLM的强大替代监督源。我们通过在物理引擎中生成随机场景，基于模拟交互创建合成问答对，并利用强化学习在此合成数据上训练LLM。我们的模型展现出对真实世界物理基准测试的零样本模拟到现实迁移能力：例如，仅使用合成模拟数据训练即可使不同规模的模型在IPhO（国际物理奥林匹克竞赛）问题上的表现提升5-10个百分点。这些结果表明物理模拟器能够作为可扩展的数据生成器，使LLM突破互联网规模问答数据的限制，获得深层次的物理推理能力。代码地址：https://sim2reason.github.io/。

English

We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question-answer (QA) pairs, a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5-10 percentage points across model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data. Code available at: https://sim2reason.github.io/.