에이전트 하네스 안전 감사

초록

LLM 에이전트는 점점 더 도구를 분배하고, 리소스를 할당하며, 전문화된 구성 요소 간에 메시지를 라우팅하는 실행 하네스 내에서 실행됩니다. 그러나 하네스는 허가되지 않은 리소스에 접근하거나 컨텍스트를 잘못된 에이전트에 유출하는 궤적을 통해 올바르고 무해한 답변을 반환할 수 있습니다. 출력 수준 평가는 이러한 실패를 감지할 수 없지만, 대부분의 안전 벤치마크는 최종 출력이나 종료 상태만 평가하며, 많은 위반이 종료 시점이 아닌 궤적 중간에 발생합니다. 핵심 질문은 하네스가 실행 전반에 걸쳐 사용자 의도, 권한 경계 및 정보 흐름 제약 조건을 준수하는지 여부입니다. 이러한 격차를 해결하기 위해, 우리는 경계 준수, 실행 충실도 및 시스템 안정성에 걸쳐 전체 실행 궤적을 감사하는 프레임워크인 HarnessAudit을 제안하며, 이러한 위험이 가장 두드러지는 다중 에이전트 하네스에 중점을 둡니다. 또한 우리는 내장된 안전 제약 조건을 갖춘 단일 에이전트 및 다중 에이전트 구성으로 구현된 8개의 실제 도메인에 걸친 210개의 작업으로 구성된 벤치마크인 HarnessAudit-Bench를 소개합니다. 최첨단 모델과 세 가지 다중 에이전트 프레임워크에 걸쳐 열 가지 하네스 구성을 평가한 결과, 다음과 같은 사실을 발견했습니다: (i) 작업 완료가 안전한 실행과 일치하지 않으며, 궤적 길이에 따라 위반이 누적됩니다; (ii) 안전 위험은 도메인, 작업 유형 및 에이전트 역할에 따라 다릅니다; (iii) 대부분의 위반은 리소스 접근 및 에이전트 간 정보 전송에 집중됩니다; (iv) 다중 에이전트 협업은 안전 위험 표면을 확장하는 반면, 하네스 설계는 안전 배포의 상한을 설정합니다.

English

LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.