ChatPaper.aiChatPaper

AgentDoG:人工智慧代理安全與防護的診斷性防護框架

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

January 26, 2026
作者: Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, Binxin Hu, Ling Tang, Jilin Mei, Dadi Guo, Leitao Yuan, Junyao Yang, Guanxu Chen, Qihao Lin, Yi Yu, Bo Zhang, Jiaxuan Guo, Jie Zhang, Wenqi Shao, Huiqi Deng, Zhiheng Xi, Wenjie Wang, Wenxuan Wang, Wen Shen, Zhikai Chen, Haoyu Xie, Jialing Tao, Juntao Dai, Jiaming Ji, Zhongjie Ba, Linfeng Zhang, Yong Liu, Quanshi Zhang, Lei Zhu, Zhihua Wei, Hui Xue, Chaochao Lu, Jing Shao, Xia Hu
cs.AI

摘要

人工智慧代理的興起,因自主工具使用與環境互動而引發了複雜的安全與安保挑戰。現有的防護機制模型缺乏代理風險意識及風險診斷的透明度。為建立能涵蓋複雜多元風險行為的代理防護機制,我們首先提出統一的三維分類法,以正交方式從風險來源(何處)、失效模式(如何)與後果(何事)三個維度系統化分類代理風險。在此結構化層級分類法的指導下,我們推出新型細粒度代理安全基準(ATBench)及專用於代理安全與安保的診斷式防護框架(AgentDoG)。AgentDoG能對代理行為軌跡進行細粒度情境化監控,更關鍵的是能診斷不安全行為及看似安全但不合理行為的根本原因,提供超越二元標籤的溯源能力與透明度,以促進有效的代理對齊。AgentDoG現提供基於Qwen和Llama模型系列的三種參數規模版本(4B、7B和8B)。大量實驗結果表明,AgentDoG在多樣化複雜互動場景中實現了最先進的代理安全調控效能。所有模型與資料集均已開源釋出。
English
The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.
PDF606January 29, 2026