AgentDoG 1.5: AI 에이전트 안전 및 보안을 위한 경량 확장형 정렬 프레임워크

초록

현대의 오픈월드 에이전트(예: OpenClaw)는 강력한 교차 환경 실행 능력을 보여주지만, 광범위한 새로운 보안 위험 소스를 도입합니다. 한편, 첨단 프론티어 AI 모델은 공격 장벽을 급격히 낮추어, 현재의 에이전트 정렬 프레임워크가 실제 배포에 부적합하게 만듭니다. 이러한 새로운 위협에 대응하기 위해, 우리는 가볍고 확장 가능한 에이전트 보안 정렬 프레임워크를 제안합니다. 구체적으로, Codex 및 OpenClaw 실행 시나리오에서 발생하는 새로운 위험을 수용하기 위해 에이전트 보안 분류 체계를 업데이트합니다. 또한, 영향 함수 정제를 활용한 분류 체계 기반 데이터 엔진을 구축하여 약 1k 샘플만으로 경량 AgentDoG 1.5 변종(0.8B, 2B, 4B, 8B 파라미터)을 훈련시키고, 선도적인 폐쇄형 모델(예: GPT-5.4)과 비교 가능한 성능을 달성합니다. AgentDoG 1.5를 기반으로, 고효율 에이전트 보안 SFT 및 RL 훈련 환경을 구축하여 Docker 수준 환경에서의 배포 오버헤드를 두 자릿수로 줄입니다. 마지막으로, AgentDoG 1.5를 훈련 없는 온라인 가드레일로 배포하여 실시간 보안 조정을 수행합니다. 광범위한 실험 결과는 AgentDoG 1.5가 다양하고 복잡한 상호작용 에이전트 시나리오에서 최첨단 성능을 달성함을 나타냅니다. 모든 모델과 데이터셋은 공개적으로 공개됩니다.

English

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.