ShieldAgent: 検証可能な安全ポリシー推論によるエージェントの保護

要旨

基盤モデルを搭載した自律エージェントは、様々な現実世界のアプリケーションで広く採用されています。しかし、悪意のある指示や攻撃に対して非常に脆弱であり、プライバシーの侵害や経済的損失などの深刻な結果を招く可能性があります。さらに重要なことに、エージェントの複雑で動的な性質により、既存のLLM向けのガードレールは適用できません。これらの課題に対処するため、我々はShieldAgentを提案します。これは、論理推論を通じて他の保護対象エージェントのアクショントラジェクトリに対して明示的な安全ポリシーの遵守を強制する初のガードレールエージェントです。具体的には、ShieldAgentはまず、ポリシードキュメントから検証可能なルールを抽出し、それらをアクションベースの確率的ルール回路のセットとして構造化することで、安全ポリシーモデルを構築します。保護対象エージェントのアクショントラジェクトリが与えられると、ShieldAgentは関連するルール回路を取得し、包括的なツールライブラリと形式検証用の実行可能なコードを活用してシールディングプランを生成します。さらに、エージェント向けのガードレールベンチマークが不足していることを踏まえ、我々はShieldAgent-Benchを導入します。これは、6つのWeb環境と7つのリスクカテゴリにわたるSOTA攻撃を通じて収集された、3,000の安全関連のエージェント指示とアクショントラジェクトリのペアからなるデータセットです。実験の結果、ShieldAgentはShieldAgent-Benchおよび3つの既存のベンチマークでSOTAを達成し、従来の手法を平均11.3%上回り、90.1%の高い再現率を示しました。さらに、ShieldAgentはAPIクエリを64.7%削減し、推論時間を58.2%短縮し、エージェントの保護における高い精度と効率を実証しました。

English

Autonomous agents powered by foundation models have seen widespread adoption across various real-world applications. However, they remain highly vulnerable to malicious instructions and attacks, which can result in severe consequences such as privacy breaches and financial losses. More critically, existing guardrails for LLMs are not applicable due to the complex and dynamic nature of agents. To tackle these challenges, we propose ShieldAgent, the first guardrail agent designed to enforce explicit safety policy compliance for the action trajectory of other protected agents through logical reasoning. Specifically, ShieldAgent first constructs a safety policy model by extracting verifiable rules from policy documents and structuring them into a set of action-based probabilistic rule circuits. Given the action trajectory of the protected agent, ShieldAgent retrieves relevant rule circuits and generates a shielding plan, leveraging its comprehensive tool library and executable code for formal verification. In addition, given the lack of guardrail benchmarks for agents, we introduce ShieldAgent-Bench, a dataset with 3K safety-related pairs of agent instructions and action trajectories, collected via SOTA attacks across 6 web environments and 7 risk categories. Experiments show that ShieldAgent achieves SOTA on ShieldAgent-Bench and three existing benchmarks, outperforming prior methods by 11.3% on average with a high recall of 90.1%. Additionally, ShieldAgent reduces API queries by 64.7% and inference time by 58.2%, demonstrating its high precision and efficiency in safeguarding agents.

ShieldAgent: 検証可能な安全ポリシー推論によるエージェントの保護

ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning

要旨

Support