AgentDoG 1.5：一种轻量级、可扩展的对齐框架，用于AI智能体安全与防护

摘要

現代開放世界智能體（如 OpenClaw）展現出強大的跨環境執行能力，但也引入了廣泛的新型安全風險來源。同時，先進的前沿 AI 模型大幅降低了攻擊門檻，使得現有的智能體對齊框架無法滿足實際部署需求。為應對這些新興威脅，我們提出了一種輕量級且可擴展的智能體安全對齊框架。具體而言，我們更新了智能體安全分類法，以涵蓋來自 Codex 和 OpenClaw 執行場景的新興風險。我們進一步構建了基於分類法引導的數據引擎，結合影響函數淨化技術，僅使用約 1000 個樣本訓練出輕量級的 AgentDoG 1.5 變體（參數量為 0.8B、2B、4B 和 8B），其性能可與頂級閉源模型（如 GPT-5.4）相媲美。基於 AgentDoG 1.5，我們構建了一套高效的智能體安全監督微調（SFT）和強化學習（RL）訓練環境，將 Docker 級環境中的部署開銷降低了兩個數量級。最後，我們將 AgentDoG 1.5 部署為無需訓練的在線防護欄，用於實時安全審核。大量實驗結果表明，AgentDoG 1.5 在多元且複雜的交互式智能體場景中達到了最先進的性能。所有模型與數據集均已開源釋出。

English

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.