SimpleTIR: マルチターン・ツール統合推論のためのエンドツーエンド強化学習

要旨

大規模言語モデル（LLMs）は、外部ツールとの相互作用を通じて推論能力を大幅に向上させることができます。このパラダイムは「ツール統合型推論（Tool-Integrated Reasoning, TIR）」として知られています。しかし、強化学習（Reinforcement Learning, RL）を用いてTIRをマルチターンシナリオに拡張する際、訓練の不安定性や性能の崩壊がしばしば障害となります。この不安定性は主に、外部ツールからのフィードバックによる分布のずれが原因で、低確率のトークンが生成されることに起因しています。この問題は連続するターンで累積し、勾配ノルムの爆発的な増大を引き起こし、訓練プロセスを妨げます。この課題に対処するため、我々はSimpleTIRを導入します。これはプラグアンドプレイ型のアルゴリズムで、マルチターンTIR訓練を安定化します。その核心戦略は、コードブロックも最終回答も生成しない「無効ターン」を含む軌跡を特定し、フィルタリングすることです。これらの問題のある軌跡をポリシー更新から除外することで、SimpleTIRは有害な高振幅の勾配をブロックし、学習ダイナミクスを安定化します。大規模な実験により、SimpleTIRが難しい数学推論ベンチマークで最先端の性能を達成することが示されました。特に、Qwen2.5-7Bベースモデルから開始した場合、テキストのみのベースラインの22.1からAIME24スコアを50.5に大幅に向上させました。さらに、教師あり微調整の制約を回避することで、SimpleTIRはモデルに多様で洗練された推論パターン（自己修正や相互検証など）を発見することを促します。

English

Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.

SimpleTIR: マルチターン・ツール統合推論のためのエンドツーエンド強化学習

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

要旨

Support