ChatPaper.aiChatPaper

通过合成数据构建通用智能体系统的基础防护框架

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

October 10, 2025
作者: Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, Liubov Nedoshivina, Pin-Yu Chen, Prasanna Sattigeri, Xiangliang Zhang
cs.AI

摘要

尽管大语言模型(LLM)代理能够规划多步骤任务,但在执行任何行动之前,在规划阶段进行干预通常是防止危害的最安全方式,因为某些风险一旦实施可能导致严重后果。然而,现有的防护措施大多在事后执行,难以扩展,且在计划层面缺乏可控的监督空间。为应对这一挑战,我们指出了当前研究中的三个关键缺口:数据缺口、模型缺口和评估缺口。为填补数据缺口,我们引入了AuraGen,一个可控引擎,它能够(i)合成良性轨迹,(ii)注入难度校准的类别标记风险,以及(iii)通过自动化奖励模型过滤输出,为执行前安全生成大量可靠语料库。针对守护模型缺口,我们提出了基础防护Safiron,它结合了跨规划适配器与紧凑的守护模型。适配器统一了不同输入格式,而Safiron则标记风险案例、分配风险类型并生成理由;通过广泛探索的数据配方进行两阶段训练,Safiron实现了跨环境的稳健迁移。为弥补评估缺口,我们发布了Pre-Exec Bench,一个覆盖多样化工具和分支轨迹的现实基准,它在人类验证的场景中测量检测、细粒度分类、解释及跨规划泛化能力。大量实验表明,所提出的防护措施在Pre-Exec Bench上相较于强基线持续取得优势,而消融实验进一步提炼出可操作的最佳实践,为更安全的代理系统提供了实用模板。
English
While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release Pre-Exec Bench, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains of the proposed guardrail over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.
PDF262October 14, 2025