稽核代理安全吊帶

摘要

LLM 代理越來越常在執行框架中運行，這些框架負責調度工具、分配資源，並在專業組件之間路由訊息。然而，一個框架可能返回一個正確且良性的答案，但其執行軌跡卻可能存取未經授權的資源，或將上下文洩露給錯誤的代理。輸出層級的評估無法察覺這些失敗，儘管許多違規行為發生在執行軌跡的中段而非終止時，但多數安全基準僅對最終輸出或終止狀態進行評分。核心問題在於框架是否在整個執行過程中尊重使用者意圖、權限邊界以及資訊流限制。為解決此缺口，我們提出 HarnessAudit，這是一個能全面審查執行軌跡的框架，涵蓋邊界合規性、執行忠實度與系統穩定性，尤其聚焦於這些風險最為顯著的多代理框架。我們進一步引入 HarnessAudit-Bench，這是一個包含 210 項任務的基準測試，涵蓋八個真實世界領域，並以單代理與多代理兩種配置嵌入安全限制。評估前沿模型與三個多代理框架上的十種框架配置後，我們發現：(i) 任務完成度與安全執行不一致，且違規行為隨軌跡長度累積；(ii) 安全風險因領域、任務類型與代理角色而異；(iii) 多數違規集中在資源存取與代理間資訊傳遞；(iv) 多代理協作擴大了安全風險面，而框架設計則決定了安全部署的上限。

English

LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.