ChatPaper.aiChatPaper

ToolSafe:通过主动式步骤级防护与反馈机制增强基于LLM智能体的工具调用安全性

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

January 15, 2026
作者: Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, Jing Shao
cs.AI

摘要

尽管基于大语言模型的智能体能够通过调用外部工具与环境交互,但其扩展能力也同时放大了安全风险。实时监控智能体在步骤层级的工具调用行为,并在不安全执行前主动干预,对于智能体部署至关重要,然而这一领域的研究仍显不足。本研究首先构建了TS-Bench——一个专用于大语言模型智能体工具调用安全检测的新型基准测试集。随后我们采用多任务强化学习开发了防护模型TS-Guard,该模型通过分析交互历史记录,能在执行前主动识别不安全的工具调用行为。它通过评估请求危害性及行为-攻击关联度,生成可解释、可泛化的安全判定与反馈。此外,我们提出TS-Flow这一基于防护反馈的驱动推理框架,该框架在提示注入攻击场景下,能将ReAct风格智能体的有害工具调用平均减少65%,并将良性任务完成率提升约10%。
English
While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.
PDF181January 17, 2026