PolicyGuard: LLMエージェントにおけるポリシー遵守のための対話に基づくサブエージェント検証器

要旨

LLMエージェントは、組織を代表してツール呼び出しを通じてユーザーリクエストを処理し、システムプロンプトに記載された企業ポリシーに従わなければならない。従来の研究は、この問題を安全対策の問題として捉えていた――すなわち、非準拠なエージェントの動作をブロックする外部チェックである。我々は、ポリシー遵守はより広範な問題であると主張する。実際のワークフローは複数のターンにわたって展開され、明示的なユーザー確認と事前読取りを必要とし、単一の引数の値ではなく、対話の内容に依存する。この要件を満たすには、(i) 完全な会話コンテキスト、(ii) ポリシーと現在の対話にわたる自己推論、(iii) エージェントの次のターンを導く会話固有の修正――これら3つの能力が必要であり、従来の安全対策の研究では過小評価されてきた。我々は、POLICYGUARDを導入する。これは、エージェントと対話のビューを共有し、コンテキスト内でポリシーを推論し、エージェントの次のターンに対して実用的なフィードバックを提供するサブエージェント検証器である。tau^2-BENCH航空データセットにおいて、3つのベンダー（GPT-5.4、Claude Sonnet 4.6、Gemini 2.5 Pro）を用い、各設定で4回の試行を行った結果、POLICYGUARDはPASS4を+12.0 / +6.0 / +12.0パーセントポイント向上させた。呼び出しごとの分析では、POLICYGUARDはより高いポリシー違反再現率を達成しつつ、引数レベルのガードと比較してブロック頻度は約半分であることが示された。

English

LLM agents handle user requests on behalf of organizations through tool calls and must follow the company policies stated in their system prompts. Prior work approaches this as a safeguarding problem -- external checks that block non-compliant agent actions. We argue that policy adherence is a broader problem: real workflows unfold across many turns, require explicit user confirmation and prerequisite reads, and hinge on the content of the dialogue rather than on any single argument value. Meeting this bar requires (i) full conversation context, (ii) self-reasoning over the policy and the current dialogue, and (iii) conversation-specific remediation that guides the agent's next turn -- three capabilities that prior safeguard work has often underestimated. We introduce POLICYGUARD, a sub-agent verifier that shares the agent's view of the dialogue, reasons over the policy in context, and provides actionable feedback for the agent's next turn. On tau^2-BENCH airline across three vendors (GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro) with four trials per setting, POLICYGUARD improves PASS4 by +12.0 / +6.0 / +12.0 pp. Per-call analyses show POLICYGUARD achieves higher policy-violation recall while blocking roughly half as often as argument-level guards.