安全带安全审计代理
Auditing Agent Harness Safety
May 14, 2026
作者: Chengzhi Liu, Yichen Guo, Yepeng Liu, Yuzhe Yang, Qianqi Yan, Xuandong Zhao, Wenyue Hua, Sheng Liu, Sharon Li, Yuheng Bu, Xin Eric Wang
cs.AI
摘要
LLM智能体越来越多地在执行框架内运行,这些框架调度工具、分配资源,并在专门组件之间路由消息。然而,执行框架可能生成一个正确、无风险的答案,但其执行轨迹可能访问了未经授权的资源,或将上下文泄露给了错误的智能体。输出级评估无法发现这些失败,然而大多数安全基准测试仅对最终输出或终止状态进行评分,尽管许多违规行为发生在轨迹中途而非终止时刻。核心问题在于执行框架是否在整个执行过程中尊重用户意图、权限边界和信息流约束。为弥补这一空白,我们提出了HarnessAudit框架,该框架从边界合规性、执行保真度和系统稳定性三个维度对完整执行轨迹进行审计,重点关注这些风险最为突出的多智能体执行框架。我们进一步引入了HarnessAudit-Bench基准测试,涵盖八个真实世界领域的210个任务,在单智能体和多智能体两种配置下实例化,并嵌入了安全约束。我们在前沿模型和三个多智能体框架上评估了十种执行框架配置,发现:(i) 任务完成度与安全执行存在错位,违规行为随轨迹长度累积;(ii) 安全风险因领域、任务类型和智能体角色而异;(iii) 多数违规行为集中于资源访问和智能体间信息传输;(iv) 多智能体协作扩大了安全风险面,而执行框架设计决定了安全部署的上限。
English
LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.