基於物理模擬器的強化學習求解物理奧林匹克競賽題目

摘要

隨著DeepSeek-R1的問世，我們見證了大型語言模型推理能力的顯著進步。然而，這項進展主要得益於網路問答對的豐富資源，但此類數據在規模上存在限制且主要集中在數學等領域，因此將成為未來發展的主要瓶頸。相比之下，物理學等其他科學領域缺乏大規模問答數據集來有效訓練具備推理能力的模型。在本研究中，我們證明物理模擬器可作為訓練物理推理大型語言模型的強大替代監督來源。我們透過物理引擎生成隨機場景，從模擬互動中創建合成問答對，並利用強化學習在此合成數據上訓練大型語言模型。我們的模型展現出對真實世界物理基準的零樣本模擬到現實遷移能力：例如，僅使用合成模擬數據進行訓練，就能在國際物理奧林匹克競賽題目上將不同規模模型的表現提升5-10個百分點。這些結果表明，物理模擬器可作為可擴展的數據生成器，使大型語言模型能夠突破網路規模問答數據的限制，獲得深層物理推理能力。程式碼請參見：https://sim2reason.github.io/。

English

We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question-answer (QA) pairs, a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5-10 percentage points across model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data. Code available at: https://sim2reason.github.io/.

基於物理模擬器的強化學習求解物理奧林匹克競賽題目

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

摘要

Support