AgentDoG 1.5：一种轻量级且可扩展的AI代理安全与安保对齐框架

摘要

当前如OpenClaw等现代开放世界智能体虽展现出强大的跨环境执行能力，却也引入了广泛的新型安全风险源。与此同时，前沿AI模型的快速发展大幅降低了攻击门槛，使现有智能体对齐框架难以满足实际部署需求。为应对这些新兴威胁，我们提出了一种轻量化、可扩展的智能体安全对齐框架。具体而言，我们更新了智能体安全分类体系，以涵盖Codex和OpenClaw执行场景中的新兴风险；在此基础上构建了基于分类学引导的数据引擎，并利用影响函数净化技术，仅需约1000个样本即可训练出轻量级AgentDoG 1.5系列模型（参数规模为0.8B、2B、4B和8B），性能与GPT-5.4等领先闭源模型相当。基于AgentDoG 1.5，我们构建了高效的智能体安全监督微调（SFT）与强化学习（RL）训练环境，将Docker级环境中的部署开销降低了两个数量级。最后，我们将AgentDoG 1.5作为免训练的在线护栏，用于实时安全管控。大量实验结果表明，AgentDoG 1.5在多样化且复杂的交互式智能体场景中达到了最先进性能。我们已开源所有模型与数据集。

English

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.