R1-代碼解釋器：通過監督學習與強化學習訓練大型語言模型進行代碼推理

摘要

儘管R1類模型在推理和規劃方面取得了進展，大型語言模型（LLMs）在需要精確計算、符號操作、優化和算法推理的任務上仍然面臨挑戰，這些任務中文本推理缺乏代碼執行的嚴謹性。一個關鍵挑戰是讓LLMs能夠決定何時使用文本推理，何時生成代碼。雖然OpenAI訓練模型在需要時調用代碼解釋器，但公開研究缺乏如何對齊預訓練的LLMs以有效利用代碼並在多樣化任務中泛化的指導。我們提出了R1-Code-Interpreter，這是一個僅限文本的LLM的擴展，通過多輪監督微調（SFT）和強化學習（RL）訓練，以在逐步推理過程中自主生成多個代碼查詢。我們策劃了144個推理和規劃任務（107個用於訓練，37個用於測試），每個任務包含超過200個多樣化的問題。我們使用各種SFT和RL策略微調Qwen-2.5模型（3B/7B/14B），研究不同的答案格式、推理與非推理模型、冷啟動與熱啟動、GRPO與PPO，以及掩碼與非掩碼的代碼輸出。與之前在狹窄領域的RL工作不同，我們發現由於任務多樣性高和代碼執行成本高，代碼解釋器訓練顯著更難，這凸顯了SFT階段的關鍵作用。我們的最終模型R1-CI-14B在37個測試任務上的平均準確率從44.0%提高到64.1%，超過了GPT-4o（僅文本：58.6%），並接近使用代碼解釋器的GPT-4o（70.9%），通過代碼生成展現了新興的自檢行為。數據集、代碼和模型可在https://github.com/yongchao98/R1-Code-Interpreter和https://huggingface.co/yongchao98獲取。

English

Despite advances in reasoning and planning of R1-like models, Large Language Models (LLMs) still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0\% to 64.1\%, outperforming GPT-4o (text-only: 58.6\%) and approaching GPT-4o with Code Interpreter (70.9\%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.

R1-代碼解釋器：通過監督學習與強化學習訓練大型語言模型進行代碼推理

R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning

摘要

Support