AgentDoG 1.5:一种轻量级且可扩展的AI代理安全与安保对齐框架
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security
May 28, 2026
作者: Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Junxiao Yang, Kun Wang, Zhiheng Xi, Linfeng Zhang, Ranjie Duan, Bo Zhang, Wenjie Wang, Wen Shen, Qiaosheng Zhang, Yan Teng, Chaochao Lu, Rui Mei, Man Li, Jialing Tao, Xi Lin, Tianhang Zheng, Yong Liu, Quanshi Zhang, Lei Zhu, Xingjun Ma, Junhua Liu, Hui Xue, Xiaoxiang Zuo, Xiangnan He, Chao Shen, Xianglong Liu, Minlie Huang, Jing Shao, Xia Hu
cs.AI
摘要
当前如OpenClaw等现代开放世界智能体虽展现出强大的跨环境执行能力,却也引入了广泛的新型安全风险源。与此同时,前沿AI模型的快速发展大幅降低了攻击门槛,使现有智能体对齐框架难以满足实际部署需求。为应对这些新兴威胁,我们提出了一种轻量化、可扩展的智能体安全对齐框架。具体而言,我们更新了智能体安全分类体系,以涵盖Codex和OpenClaw执行场景中的新兴风险;在此基础上构建了基于分类学引导的数据引擎,并利用影响函数净化技术,仅需约1000个样本即可训练出轻量级AgentDoG 1.5系列模型(参数规模为0.8B、2B、4B和8B),性能与GPT-5.4等领先闭源模型相当。基于AgentDoG 1.5,我们构建了高效的智能体安全监督微调(SFT)与强化学习(RL)训练环境,将Docker级环境中的部署开销降低了两个数量级。最后,我们将AgentDoG 1.5作为免训练的在线护栏,用于实时安全管控。大量实验结果表明,AgentDoG 1.5在多样化且复杂的交互式智能体场景中达到了最先进性能。我们已开源所有模型与数据集。
English
Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.