R1代码解释器：通过监督学习与强化学习训练大语言模型进行代码推理

摘要

尽管R1类模型在推理和规划方面取得了进展，但大型语言模型（LLMs）在处理需要精确计算、符号操作、优化和算法推理的任务时仍面临挑战，因为文本推理缺乏代码执行的严谨性。一个关键难题是让LLMs能够决定何时使用文本推理，何时生成代码。虽然OpenAI训练模型在需要时调用代码解释器，但公开研究缺乏关于如何调整预训练LLMs以有效利用代码并泛化到多样化任务的指导。我们提出了R1-Code-Interpreter，这是一个通过多轮监督微调（SFT）和强化学习（RL）训练的纯文本LLM扩展，能够在逐步推理过程中自主生成多个代码查询。我们精心策划了144个推理和规划任务（107个用于训练，37个用于测试），每个任务包含超过200个多样化问题。我们使用多种SFT和RL策略对Qwen-2.5模型（3B/7B/14B）进行微调，探讨了不同答案格式、推理与非推理模型、冷启动与热启动、GRPO与PPO、以及掩码与非掩码代码输出的效果。与之前针对狭窄领域的RL工作不同，我们发现代码解释器训练由于任务多样性和代码执行成本高而显著困难，凸显了SFT阶段的关键作用。我们的最终模型R1-CI-14B在37个测试任务上的平均准确率从44.0%提升至64.1%，超越了GPT-4o（纯文本：58.6%），并接近了使用代码解释器的GPT-4o（70.9%），通过代码生成实现了自检行为的涌现。数据集、代码和模型可在https://github.com/yongchao98/R1-Code-Interpreter 和 https://huggingface.co/yongchao98 获取。

English

Despite advances in reasoning and planning of R1-like models, Large Language Models (LLMs) still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0\% to 64.1\%, outperforming GPT-4o (text-only: 58.6\%) and approaching GPT-4o with Code Interpreter (70.9\%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.

R1代码解释器：通过监督学习与强化学习训练大语言模型进行代码推理

R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning

摘要

Support