ドメイン特化型エージェントのための記号的防護柵：有用性を損なわずに安全性とセキュリティを強固に保証する手法

要旨

ツールを通じて環境と対話するAIエージェントは強力なアプリケーションを可能にするが、高いリスクを伴うビジネス環境では、意図しない行動がプライバシー侵害や金銭的損失など、許容できない被害を引き起こす可能性がある。訓練ベースの手法やニューラルガードレールなどの既存の緩和策はエージェントの信頼性を向上させるが、保証を提供することはできない。本研究では、AIエージェントの強力な安全性とセキュリティを保証する実用的な手段として、記号的ガードレールを検討する。3部構成の研究では、80の最先端エージェント安全性・セキュリティベンチマークを系統的にレビューして評価対象ポリシーを特定し、記号的ガードレールで保証可能なポリシー要件を分析し、τ^2-Bench、CAR-bench、MedAgentBenchにおいて記号的ガードレールが安全性、セキュリティ、エージェント成功率に与える影響を評価する。ベンチマークの85%が具体的なポリシーを欠き、代わりに未定義の高次元目標や常識に依存していることがわかった。特定されたポリシーのうち、74%のポリシー要件は、しばしば単純で低コストなメカニズムを用いた記号的ガードレールで強制可能であった。これらのガードレールは、エージェントの有用性を損なうことなく安全性とセキュリティを向上させる。全体として、我々の結果は、記号的ガードレールが、特にドメイン特化型AIエージェントにおいて、一部の安全性・セキュリティ要件を保証する実用的で効果的な方法であることを示唆している。すべてのコードと成果物はhttps://github.com/hyn0027/agent-symbolic-guardrailsで公開している。

English

AI agents that interact with their environments through tools enable powerful applications, but in high-stakes business settings, unintended actions can cause unacceptable harm, such as privacy breaches and financial loss. Existing mitigations, such as training-based methods and neural guardrails, improve agent reliability but cannot provide guarantees. We study symbolic guardrails as a practical path toward strong safety and security guarantees for AI agents. Our three-part study includes a systematic review of 80 state-of-the-art agent safety and security benchmarks to identify the policies they evaluate, an analysis of which policy requirements can be guaranteed by symbolic guardrails, and an evaluation of how symbolic guardrails affect safety, security, and agent success on τ^2-Bench, CAR-bench, and MedAgentBench. We find that 85\% of benchmarks lack concrete policies, relying instead on underspecified high-level goals or common sense. Among the specified policies, 74\% of policy requirements can be enforced by symbolic guardrails, often using simple, low-cost mechanisms. These guardrails improve safety and security without sacrificing agent utility. Overall, our results suggest that symbolic guardrails are a practical and effective way to guarantee some safety and security requirements, especially for domain-specific AI agents. We release all codes and artifacts at https://github.com/hyn0027/agent-symbolic-guardrails.

ドメイン特化型エージェントのための記号的防護柵：有用性を損なわずに安全性とセキュリティを強固に保証する手法

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

要旨

Support