ChatPaper.aiChatPaper

AgentDoG:面向AI智能体安全与防护的诊断性护栏框架

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

January 26, 2026
作者: Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, Binxin Hu, Ling Tang, Jilin Mei, Dadi Guo, Leitao Yuan, Junyao Yang, Guanxu Chen, Qihao Lin, Yi Yu, Bo Zhang, Jiaxuan Guo, Jie Zhang, Wenqi Shao, Huiqi Deng, Zhiheng Xi, Wenjie Wang, Wenxuan Wang, Wen Shen, Zhikai Chen, Haoyu Xie, Jialing Tao, Juntao Dai, Jiaming Ji, Zhongjie Ba, Linfeng Zhang, Yong Liu, Quanshi Zhang, Lei Zhu, Zhihua Wei, Hui Xue, Chaochao Lu, Jing Shao, Xia Hu
cs.AI

摘要

人工智能代理的崛起带来了由自主工具使用和环境交互引发的复杂安全挑战。当前防护模型缺乏对代理风险的认知能力及风险诊断的透明度。为构建覆盖复杂多样风险行为的代理防护机制,我们首次提出统一的三维分类法,从风险来源(何处)、失效模式(如何)和后果影响(什么)三个正交维度系统划分代理风险。基于这种结构化层次分类体系,我们推出了新型细粒度代理安全基准(ATBench)及代理安全诊断防护框架(AgentDoG)。该框架能对代理行为轨迹进行细粒度的情境化监控,更重要的是可诊断不安全行为及看似安全但不合理行为的根本原因,通过提供溯源信息和超越二元标签的透明度来促进有效的代理对齐。AgentDoG提供Qwen和Llama模型系列的三种参数规模(4B/7B/8B),大量实验表明其在多样复杂交互场景中实现了最先进的代理安全管控性能。所有模型与数据集均已开源发布。
English
The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.
PDF606January 29, 2026