AgentDoG 1.5:一种轻量级、可扩展的对齐框架,用于AI智能体安全与防护
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security
May 28, 2026
作者: Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Junxiao Yang, Kun Wang, Zhiheng Xi, Linfeng Zhang, Ranjie Duan, Bo Zhang, Wenjie Wang, Wen Shen, Qiaosheng Zhang, Yan Teng, Chaochao Lu, Rui Mei, Man Li, Jialing Tao, Xi Lin, Tianhang Zheng, Yong Liu, Quanshi Zhang, Lei Zhu, Xingjun Ma, Junhua Liu, Hui Xue, Xiaoxiang Zuo, Xiangnan He, Chao Shen, Xianglong Liu, Minlie Huang, Jing Shao, Xia Hu
cs.AI
摘要
現代開放世界智能體(如 OpenClaw)展現出強大的跨環境執行能力,但也引入了廣泛的新型安全風險來源。同時,先進的前沿 AI 模型大幅降低了攻擊門檻,使得現有的智能體對齊框架無法滿足實際部署需求。為應對這些新興威脅,我們提出了一種輕量級且可擴展的智能體安全對齊框架。具體而言,我們更新了智能體安全分類法,以涵蓋來自 Codex 和 OpenClaw 執行場景的新興風險。我們進一步構建了基於分類法引導的數據引擎,結合影響函數淨化技術,僅使用約 1000 個樣本訓練出輕量級的 AgentDoG 1.5 變體(參數量為 0.8B、2B、4B 和 8B),其性能可與頂級閉源模型(如 GPT-5.4)相媲美。基於 AgentDoG 1.5,我們構建了一套高效的智能體安全監督微調(SFT)和強化學習(RL)訓練環境,將 Docker 級環境中的部署開銷降低了兩個數量級。最後,我們將 AgentDoG 1.5 部署為無需訓練的在線防護欄,用於實時安全審核。大量實驗結果表明,AgentDoG 1.5 在多元且複雜的交互式智能體場景中達到了最先進的性能。所有模型與數據集均已開源釋出。
English
Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.