为何多步工具使用强化学习会崩溃以及监督信号如何修复

摘要

工具使用使得大型语言模型（LLMs）能够执行复杂任务，而近期基于智能体的强化学习（RL）方法在提升模型能力方面展现出潜力。然而，在工具使用任务中，单独使用强化学习往往会导致训练不稳定或性能提升有限。我们的实验发现，部分模型会出现灾难性崩溃，表现为性能突然下降且工具调用结构失效。分析表明，这些失败源于特定控制令牌中出现意外概率尖峰，破坏了结构化执行流程，但底层工具使用能力并未丧失，仅因特定格式问题而受到遮蔽。针对这一问题，我们系统研究了多种监督信号，包括离策略监督、提示引导、错误示例监督等，并在同步与交错两种训练方案下进行应用。研究发现，将监督微调（SFT）与强化学习交替进行能显著提升稳定性，但在格式与内容均出现分布外（OOD）的评估场景中性能有所下降。我们还分析了学习率的影响及不同设置下的泛化表现。这些结果凸显了理解强化学习失败机理的重要性，并展示了多样化的监督信号如何引导探索性学习，从而实现对复杂多步工具使用任务的鲁棒训练。我们的代码已开源：https://github.com/hypasd-art/Tool-RL-Box。

English

Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities. However, RL alone often leads to instability or limited gains in tool-use tasks. In our experiments, some models exhibit catastrophic collapse, where performance abruptly drops and tool-invocation structures fail. The analysis reveals that these failures stem from unexpected probability spikes in specific control tokens, disrupting structured execution, yet the underlying tool-use capability remains intact, merely obscured by specific formats. To address this, we systematically investigate a diverse set of supervisory signals, including off-policy supervision, hint-based guidance, erroneous example supervision, and others, applied under both synchronous and interleaved training schemes. We find that interleaving supervised fine-tuning (SFT) with RL substantially improves stability, but exhibits degraded performance under format and content out-of-distribution (OOD) evaluation. We also analyze the impact of learning rates and generalization across settings. These results highlight the importance of understanding RL failures and demonstrate how diverse supervisory signals can guide exploratory learning, enabling robust training of LLMs for complex, multi-step tool-use tasks. Our Code is available at https://github.com/hypasd-art/Tool-RL-Box.