R1-Code-Interpreter: 教師あり学習と強化学習を用いてLLMにコード推論を訓練する

要旨

R1のようなモデルの推論と計画能力が進歩しているにもかかわらず、大規模言語モデル（LLMs）は、正確な計算、記号操作、最適化、アルゴリズム的推論を必要とするタスクにおいて依然として苦戦しています。これらは、テキストベースの推論ではコード実行の厳密性を欠く領域です。主要な課題は、LLMがテキスト推論とコード生成のどちらを使用するかを判断できるようにすることです。OpenAIは必要に応じてコードインタプリタを呼び出すモデルを訓練していますが、事前訓練されたLLMを効果的にコード活用し、多様なタスクに汎化させるための公的な研究ガイドラインは不足しています。本論文では、R1-Code-Interpreterを紹介します。これは、テキストのみのLLMを拡張し、多段階の教師あり微調整（SFT）と強化学習（RL）を通じて、段階的な推論中に複数のコードクエリを自律的に生成するように訓練したものです。144の推論と計画タスク（訓練用107、テスト用37）をキュレーションし、各タスクには200以上の多様な質問を用意しました。Qwen-2.5モデル（3B/7B/14B）を様々なSFTとRL戦略で微調整し、異なる回答形式、推論モデルと非推論モデル、コールドスタートとウォームスタート、GRPO対PPO、マスクされたコード出力とマスクされていないコード出力を調査しました。狭い領域での従来のRL研究とは異なり、コードインタプリタの訓練はタスクの多様性と高コストなコード実行のため、大幅に困難であり、SFT段階の重要性が浮き彫りになりました。最終モデルであるR1-CI-14Bは、37のテストタスクにおける平均精度を44.0％から64.1％に向上させ、GPT-4o（テキストのみ：58.6％）を上回り、コードインタプリタを使用したGPT-4o（70.9％）に接近しました。これは、コード生成を通じた自己チェック行動の出現によるものです。データセット、コード、モデルはhttps://github.com/yongchao98/R1-Code-Interpreterおよびhttps://huggingface.co/yongchao98で公開されています。

English

Despite advances in reasoning and planning of R1-like models, Large Language Models (LLMs) still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0\% to 64.1\%, outperforming GPT-4o (text-only: 58.6\%) and approaching GPT-4o with Code Interpreter (70.9\%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.

R1-Code-Interpreter: 教師あり学習と強化学習を用いてLLMにコード推論を訓練する

R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning

要旨

Support