ChatPaper.aiChatPaper

SimpleTIR:面向多轮工具集成推理的端到端强化学习

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

September 2, 2025
作者: Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, Bo An
cs.AI

摘要

大型语言模型(LLMs)通过与外部工具的交互,可显著提升其推理能力,这一范式被称为工具集成推理(Tool-Integrated Reasoning, TIR)。然而,利用强化学习(Reinforcement Learning, RL)将TIR扩展至多轮交互场景时,常因训练不稳定及性能崩溃而受阻。我们发现,此类不稳定性主要源于外部工具反馈导致的分布漂移,进而引发低概率令牌的生成。这一问题在连续轮次中累积,造成灾难性的梯度范数爆炸,从而扰乱训练进程。为应对这一挑战,我们提出了SimpleTIR,一种即插即用的算法,旨在稳定多轮TIR训练。其核心策略是识别并过滤掉包含无效轮次(即未生成代码块或最终答案的轮次)的轨迹。通过将这些有问题的轨迹从策略更新中移除,SimpleTIR有效阻断了有害的高幅值梯度,从而稳定了学习动态。大量实验表明,SimpleTIR在具有挑战性的数学推理基准测试中达到了最先进的性能,特别是在以Qwen2.5-7B基础模型为起点时,将AIME24分数从纯文本基线的22.1显著提升至50.5。此外,通过避免监督微调的限制,SimpleTIR鼓励模型发现多样且复杂的推理模式,如自我修正与交叉验证。
English
Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.
PDF762September 3, 2025