도메인 특화 에이전트를 위한 상징적 안전장치: 유용성 저하 없이 강화된 안전 및 보안 보장

초록

도구를 통해 환경과 상호작용하는 AI 에이전트는 강력한 애플리케이션을 가능하게 하지만, 높은 위험이 따르는 비즈니스 환경에서는 의도하지 않은 행동이 개인정보 유출이나 재정적 손실과 같은 용납할 수 없는 피해를 초래할 수 있습니다. 훈련 기반 방법이나 신경망 기반 가드레일과 같은 기존 완화 기술은 에이전트 신뢰성을 향상시키지만 보장을 제공할 수 없습니다. 본 연구는 AI 에이전트에 대한 강력한 안전 및 보안 보장을 위한 실용적인 방안으로 기호적 가드레일을 연구합니다. 3부로 구성된 본 연구에는 평가 정책을 식별하기 위한 80개의 최첨단 에이전트 안전 및 보안 벤치마크에 대한 체계적 검토, 기호적 가드레일로 보장 가능한 정책 요구사항 분석, 그리고 τ^2-Bench, CAR-bench, MedAgentBench에서 기호적 가드레일이 안전, 보안 및 에이전트 성공에 미치는 영향 평가가 포함됩니다. 연구 결과, 벤치마크의 85%가 구체적인 정책이 부족하고, 대신 명세가 불분명한 높은 수준의 목표나 상식에 의존하고 있음을 발견했습니다. 명시된 정책 중 74%의 정책 요구사항은 종종 간단하고 저비용의 메커니즘을 사용하는 기호적 가드레일로 시행될 수 있습니다. 이러한 가드레일은 에이전트 유용성을 희생하지 않으면서 안전과 보안을 향상시킵니다. 전체적으로, 우리의 결과는 기호적 가드레일이 특히 도메인 특화 AI 에이전트의 일부 안전 및 보안 요구사항을 보장하는 실용적이고 효과적인 방법임을 시사합니다. 모든 코드와 아티팩트는 https://github.com/hyn0027/agent-symbolic-guardrails에서 공개합니다.

English

AI agents that interact with their environments through tools enable powerful applications, but in high-stakes business settings, unintended actions can cause unacceptable harm, such as privacy breaches and financial loss. Existing mitigations, such as training-based methods and neural guardrails, improve agent reliability but cannot provide guarantees. We study symbolic guardrails as a practical path toward strong safety and security guarantees for AI agents. Our three-part study includes a systematic review of 80 state-of-the-art agent safety and security benchmarks to identify the policies they evaluate, an analysis of which policy requirements can be guaranteed by symbolic guardrails, and an evaluation of how symbolic guardrails affect safety, security, and agent success on τ^2-Bench, CAR-bench, and MedAgentBench. We find that 85\% of benchmarks lack concrete policies, relying instead on underspecified high-level goals or common sense. Among the specified policies, 74\% of policy requirements can be enforced by symbolic guardrails, often using simple, low-cost mechanisms. These guardrails improve safety and security without sacrificing agent utility. Overall, our results suggest that symbolic guardrails are a practical and effective way to guarantee some safety and security requirements, especially for domain-specific AI agents. We release all codes and artifacts at https://github.com/hyn0027/agent-symbolic-guardrails.

도메인 특화 에이전트를 위한 상징적 안전장치: 유용성 저하 없이 강화된 안전 및 보안 보장

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

초록

Support