AgentDoG 1.5: AIエージェントの安全性とセキュリティのための軽量かつスケーラブルなアライメントフレームワーク

要旨

最新のオープンワールドエージェント（例：OpenClaw）は、強力なクロス環境実行能力を示す一方で、新たな広範な安全リスク源をもたらします。さらに、先進的なフロンティアAIモデルは攻撃のハードルを劇的に低下させ、現在のエージェントアライメントフレームワークは現実世界での展開には不十分です。これらの新たな脅威に対処するため、我々は軽量でスケーラブルなエージェント安全アライメントフレームワークを提案します。具体的には、CodexやOpenClawの実行シナリオから生じる新興リスクに対応するため、エージェント安全分類体系を更新します。さらに、影響関数による精製を施した分類体系誘導型データエンジンを構築し、約1000サンプルのみで軽量なAgentDoG 1.5のバリエーション（0.8B、2B、4B、8Bパラメータ）を訓練し、主要なクローズドソースモデル（例：GPT-5.4）と同等の性能を達成します。AgentDoG 1.5を基に、高効率なエージェント安全SFTおよびRL訓練環境を構築し、Dockerレベルの環境における展開オーバーヘッドを2桁削減します。最後に、AgentDoG 1.5を訓練不要のオンラインガードレールとして展開し、リアルタイムの安全モデレーションを実現します。広範な実験結果は、AgentDoG 1.5が多様で複雑な対話型エージェントシナリオにおいて最先端の性能を達成することを示しています。すべてのモデルとデータセットは公開されています。

English

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.