### 通过裂变-GRPO实现稳健工具使用:学习从执行错误中恢复
Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors
January 22, 2026
作者: Zhiwei Zhang, Fei Zhao, Rui Wang, Zezhong Wang, Bin Liang, Jiakang Wang, Yao Hu, Shaosheng Cao, Kam-Fai Wong
cs.AI
摘要
大型语言模型(LLMs)能够有效调用工具,但在多轮执行中仍显脆弱:遭遇工具调用错误后,较小模型常会退化为重复无效调用,无法解读错误反馈并进行自我修正。这种脆弱性阻碍了其在实际场景中的可靠部署——在工具交互过程中,执行错误本就难以避免。我们发现现有方法的核心局限:标准强化学习(RL)将错误视为稀疏负奖励,未能提供恢复指引;而预收集的合成纠错数据集则与模型在策略执行时的错误模式存在分布差异。为弥补这一缺口,我们提出Fission-GRPO框架,将执行错误转化为RL训练循环内的纠错监督信号。该机制的核心是通过微调的错误模拟器生成诊断反馈,将每个失败轨迹裂变为新训练实例,并基于当前策略重采样恢复路径。这使得模型能从探索过程中的具体错误中学习,而非依赖静态的预收集错误案例。在BFCL v4多轮测试集上,Fission-GRPO将Qwen3-8B的错误恢复率绝对提升5.7%,关键的是,其整体准确率较GRPO提升4%(从42.75%至46.75%),并优于专用工具调用智能体。
English
Large language models (LLMs) can call tools effectively, yet they remain brittle in multi-turn execution: following a tool call error, smaller models often degenerate into repetitive invalid re-invocations, failing to interpret error feedback and self-correct. This brittleness hinders reliable real-world deployment, where the execution errors are inherently inevitable during tool interaction procedures. We identify a key limitation of current approaches: standard reinforcement learning (RL) treats errors as sparse negative rewards, providing no guidance on how to recover, while pre-collected synthetic error-correction datasets suffer from distribution mismatch with the model's on-policy error modes. To bridge this gap, we propose Fission-GRPO, a framework that converts execution errors into corrective supervision within the RL training loop. Our core mechanism fissions each failed trajectory into a new training instance by augmenting it with diagnostic feedback from a finetuned Error Simulator, then resampling recovery rollouts on-policy. This enables the model to learn from the precise errors it makes during exploration, rather than from static, pre-collected error cases. On the BFCL v4 Multi-Turn, Fission-GRPO improves the error recovery rate of Qwen3-8B by 5.7% absolute, crucially, yielding a 4% overall accuracy gain (42.75% to 46.75%) over GRPO and outperforming specialized tool-use agents.