R1-代碼解釋器:通過監督學習與強化學習訓練大型語言模型進行代碼推理
R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning
May 27, 2025
作者: Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang, Yang Zhang, Chuchu Fan
cs.AI
摘要
儘管R1類模型在推理和規劃方面取得了進展,大型語言模型(LLMs)在需要精確計算、符號操作、優化和算法推理的任務上仍然面臨挑戰,這些任務中文本推理缺乏代碼執行的嚴謹性。一個關鍵挑戰是讓LLMs能夠決定何時使用文本推理,何時生成代碼。雖然OpenAI訓練模型在需要時調用代碼解釋器,但公開研究缺乏如何對齊預訓練的LLMs以有效利用代碼並在多樣化任務中泛化的指導。我們提出了R1-Code-Interpreter,這是一個僅限文本的LLM的擴展,通過多輪監督微調(SFT)和強化學習(RL)訓練,以在逐步推理過程中自主生成多個代碼查詢。我們策劃了144個推理和規劃任務(107個用於訓練,37個用於測試),每個任務包含超過200個多樣化的問題。我們使用各種SFT和RL策略微調Qwen-2.5模型(3B/7B/14B),研究不同的答案格式、推理與非推理模型、冷啟動與熱啟動、GRPO與PPO,以及掩碼與非掩碼的代碼輸出。與之前在狹窄領域的RL工作不同,我們發現由於任務多樣性高和代碼執行成本高,代碼解釋器訓練顯著更難,這凸顯了SFT階段的關鍵作用。我們的最終模型R1-CI-14B在37個測試任務上的平均準確率從44.0%提高到64.1%,超過了GPT-4o(僅文本:58.6%),並接近使用代碼解釋器的GPT-4o(70.9%),通過代碼生成展現了新興的自檢行為。數據集、代碼和模型可在https://github.com/yongchao98/R1-Code-Interpreter和https://huggingface.co/yongchao98獲取。
English
Despite advances in reasoning and planning of R1-like models, Large Language
Models (LLMs) still struggle with tasks requiring precise computation, symbolic
manipulation, optimization, and algorithmic reasoning, in which textual
reasoning lacks the rigor of code execution. A key challenge is enabling LLMs
to decide when to use textual reasoning versus code generation. While OpenAI
trains models to invoke a Code Interpreter as needed, public research lacks
guidance on aligning pre-trained LLMs to effectively leverage code and
generalize across diverse tasks. We present R1-Code-Interpreter, an extension
of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and
reinforcement learning (RL) to autonomously generate multiple code queries
during step-by-step reasoning. We curate 144 reasoning and planning tasks (107
for training, 37 for testing), each with over 200 diverse questions. We
fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies,
investigating different answer formats, reasoning vs. non-reasoning models,
cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs.
Unlike prior RL work on narrow domains, we find that Code Interpreter training
is significantly harder due to high task diversity and expensive code
execution, highlighting the critical role of the SFT stage. Our final model,
R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0\% to
64.1\%, outperforming GPT-4o (text-only: 58.6\%) and approaching GPT-4o with
Code Interpreter (70.9\%), with the emergent self-checking behavior via code
generation. Datasets, Codes, and Models are available at
https://github.com/yongchao98/R1-Code-Interpreter and
https://huggingface.co/yongchao98.