Auditeren van de veiligheid van agentharnassen

Samenvatting

LLM-agenten worden steeds vaker uitgevoerd binnen uitvoeringsharnassen die tools verzenden, resources toewijzen en berichten routeren tussen gespecialiseerde componenten. Een harnas kan echter een correct, goedaardig antwoord retourneren over een traject dat toegang krijgt tot onbevoegde bronnen of context lekt naar de verkeerde agent. Evaluatie op outputniveau kan deze fouten niet zien, maar de meeste veiligheidsbenchmarks scoren alleen eindoutputs of terminale toestanden, hoewel veel schendingen halverwege het traject plaatsvinden in plaats van aan het einde. De centrale vraag is of het harnas de gebruikersintentie, toestemmingsgrenzen en informatiestroombeperkingen gedurende de gehele uitvoering respecteert. Om deze kloof te overbruggen, stellen we HarnessAudit voor, een framework dat volledige uitvoeringstrajecten auditeert op naleving van grenzen, uitvoeringsgetrouwheid en systeemstabiliteit, met een focus op multi-agent harnassen waar deze risico's het meest uitgesproken zijn. We introduceren verder HarnessAudit-Bench, een benchmark van 210 taken uit acht domeinen uit de echte wereld, geïnstantieerd in zowel single-agent als multi-agent configuraties met ingebedde veiligheidsbeperkingen. Door tien harnasconfiguraties te evalueren over frontier-modellen en drie multi-agent frameworks, vinden we dat: (i) taakvoltooiing niet is afgestemd op veilige uitvoering, en schendingen nemen toe met de trajectlengte; (ii) veiligheidsrisico's variëren per domein, taaktype en agentrol; (iii) de meeste schendingen concentreren zich in toegang tot bronnen en inter-agent informatieoverdracht; en (iv) multi-agent samenwerking vergroot het veiligheidsrisicooppervlak, terwijl harnasontwerp de bovengrens van veilige inzet bepaalt.

English

LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.