稽核代理安全吊帶
Auditing Agent Harness Safety
May 14, 2026
作者: Chengzhi Liu, Yichen Guo, Yepeng Liu, Yuzhe Yang, Qianqi Yan, Xuandong Zhao, Wenyue Hua, Sheng Liu, Sharon Li, Yuheng Bu, Xin Eric Wang
cs.AI
摘要
LLM 代理越來越常在執行框架中運行,這些框架負責調度工具、分配資源,並在專業組件之間路由訊息。然而,一個框架可能返回一個正確且良性的答案,但其執行軌跡卻可能存取未經授權的資源,或將上下文洩露給錯誤的代理。輸出層級的評估無法察覺這些失敗,儘管許多違規行為發生在執行軌跡的中段而非終止時,但多數安全基準僅對最終輸出或終止狀態進行評分。核心問題在於框架是否在整個執行過程中尊重使用者意圖、權限邊界以及資訊流限制。為解決此缺口,我們提出 HarnessAudit,這是一個能全面審查執行軌跡的框架,涵蓋邊界合規性、執行忠實度與系統穩定性,尤其聚焦於這些風險最為顯著的多代理框架。我們進一步引入 HarnessAudit-Bench,這是一個包含 210 項任務的基準測試,涵蓋八個真實世界領域,並以單代理與多代理兩種配置嵌入安全限制。評估前沿模型與三個多代理框架上的十種框架配置後,我們發現:(i) 任務完成度與安全執行不一致,且違規行為隨軌跡長度累積;(ii) 安全風險因領域、任務類型與代理角色而異;(iii) 多數違規集中在資源存取與代理間資訊傳遞;(iv) 多代理協作擴大了安全風險面,而框架設計則決定了安全部署的上限。
English
LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.