领域特定智能体的符号护栏：在不牺牲实用性的前提下实现更强的安全与安全保障

摘要

通过工具与环境交互的AI智能体能够实现强大的应用，但在高风险商业场景中，意外行为可能导致无法承受的损害，例如隐私泄露和财务损失。现有的改进方案（如基于训练的方法和神经护栏）提升了智能体可靠性，但无法提供确定性保障。我们研究将符号化护栏作为实现AI智能体强安全性与安全性保障的可行路径。这项三部分研究包括：对80个前沿智能体安全基准进行系统性回顾以识别其评估策略；分析哪些策略要求可通过符号化护栏实现保障；以及在τ²-Bench、CAR-bench和MedAgentBench上评估符号化护栏对安全性、安全保障及智能体成功率的影响。研究发现，85%的基准测试缺乏具体策略，转而依赖未明确说明的高层目标或常识。在已明确的策略中，74%的策略要求可通过符号化护栏实施，且通常只需简单低成本的机制。这些护栏在提升安全性的同时不会牺牲智能体效用。总体而言，我们的结果表明符号化护栏是保障特定安全要求的实用有效方法，尤其适用于领域专用AI智能体。所有代码与实验材料已发布于https://github.com/hyn0027/agent-symbolic-guardrails。

English

AI agents that interact with their environments through tools enable powerful applications, but in high-stakes business settings, unintended actions can cause unacceptable harm, such as privacy breaches and financial loss. Existing mitigations, such as training-based methods and neural guardrails, improve agent reliability but cannot provide guarantees. We study symbolic guardrails as a practical path toward strong safety and security guarantees for AI agents. Our three-part study includes a systematic review of 80 state-of-the-art agent safety and security benchmarks to identify the policies they evaluate, an analysis of which policy requirements can be guaranteed by symbolic guardrails, and an evaluation of how symbolic guardrails affect safety, security, and agent success on τ^2-Bench, CAR-bench, and MedAgentBench. We find that 85\% of benchmarks lack concrete policies, relying instead on underspecified high-level goals or common sense. Among the specified policies, 74\% of policy requirements can be enforced by symbolic guardrails, often using simple, low-cost mechanisms. These guardrails improve safety and security without sacrificing agent utility. Overall, our results suggest that symbolic guardrails are a practical and effective way to guarantee some safety and security requirements, especially for domain-specific AI agents. We release all codes and artifacts at https://github.com/hyn0027/agent-symbolic-guardrails.

领域特定智能体的符号护栏：在不牺牲实用性的前提下实现更强的安全与安全保障

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

摘要

Support