ChatPaper.aiChatPaper

為通用代理系統建立基礎防護欄:透過合成數據實現

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

October 10, 2025
作者: Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, Liubov Nedoshivina, Pin-Yu Chen, Prasanna Sattigeri, Xiangliang Zhang
cs.AI

摘要

雖然大型語言模型(LLM)代理能夠規劃多步驟任務,但在執行任何行動之前於規劃階段進行干預,通常是防止危害的最安全方式,因為某些風險一旦執行可能導致嚴重後果。然而,現有的防護機制大多在執行後才運作,這難以擴展且幾乎無法在計劃層面進行可控的監督。為應對這一挑戰,我們指出了當前研究中的三個關鍵缺口:數據缺口、模型缺口和評估缺口。為填補數據缺口,我們引入了AuraGen,這是一個可控的引擎,它(i)合成良性軌跡,(ii)注入具有校準難度的類別標記風險,以及(iii)通過自動獎勵模型過濾輸出,從而為執行前安全生成大量可靠的語料庫。為填補守護模型缺口,我們提出了基礎防護機制Safiron,它結合了跨規劃適配器和緊湊的守護模型。適配器統一了不同的輸入格式,而Safiron則標記風險案例、分配風險類型並生成理由;通過廣泛探索的數據配方進行兩階段訓練,Safiron實現了跨設置的穩健遷移。為填補評估缺口,我們發布了Pre-Exec Bench,這是一個涵蓋多樣化工具和分支軌跡的現實基準,它在人類驗證的場景中測量檢測、細粒度分類、解釋和跨規劃泛化能力。大量實驗表明,所提出的防護機制在Pre-Exec Bench上相較於強基線取得了持續的增益,而消融實驗進一步提煉了可操作的做法,為更安全的代理系統提供了實用模板。
English
While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release Pre-Exec Bench, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains of the proposed guardrail over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.
PDF262October 14, 2025