エージェントハーネス安全性監査

要旨

LLMエージェントはますます、ツールをディスパッチし、リソースを割り当て、専門化されたコンポーネント間でメッセージをルーティングする実行ハーネス内で動作するようになっている。しかし、ハーネスは、未承認のリソースにアクセスしたり、コンテキストを誤ったエージェントに漏洩したりする軌跡を通じて、正しく無害な回答を返す可能性がある。出力レベルの評価ではこうした障害を捉えられないが、多くの安全ベンチマークは最終出力または終端状態のみを評価しており、多くの違反が終了時ではなく軌跡の中間で発生しているにもかかわらずである。中心的な問いは、ハーネスがユーザーの意図、許可境界、および情報フローの制約を実行全体を通じて尊重するかどうかである。このギャップに対処するため、我々はHarnessAuditを提案する。これは、境界準拠、実行忠実度、システム安定性にわたり完全な実行軌跡を監査するフレームワークであり、特にこれらのリスクが最も顕著なマルチエージェントハーネスに焦点を当てている。さらに、8つの実世界ドメインにわたる210のタスクからなるベンチマークHarnessAudit-Benchを導入する。これは、シングルエージェントおよびマルチエージェントの両構成で具体化され、安全性制約が組み込まれている。最先端モデルおよび3つのマルチエージェントフレームワークにわたる10のハーネス構成を評価した結果、以下のことが明らかになった。(i) タスク完了は安全な実行と一致しておらず、軌跡の長さに伴い違反が蓄積する。(ii) 安全性リスクはドメイン、タスクタイプ、エージェントの役割によって異なる。(iii) ほとんどの違反はリソースアクセスとエージェント間情報転送に集中する。(iv) マルチエージェント連携は安全性リスクの表面を拡大する一方、ハーネスの設計が安全な展開の上限を定める。

English

LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.