Verharding van agentbenchmarks met adversariële hacker-fixer loops

Samenvatting

Agent-benchmarks beoordelen inzendingen met uitkomstverificateurs die typisch handgeschreven en breekbaar zijn, waardoor ze vatbaar zijn voor beloningshacking. We auditen 1.968 taken over vijf terminal-agent-benchmarks en vinden 323 (16%) die hackbaar zijn door geavanceerde modellen met alleen de taakomschrijving. Dit corrumpeert zowel leaderboard-ranglijsten als RL-trainingssignaal, maar de standaardreactie is handmatig en reactief. We introduceren de hacker-fixer-loop, een methode voor het bouwen van exploitbestendige verificateurs zonder per taak handmatig te patchen. De loop wisselt drie LLM-agenten af: een hacker probeert de verificateur te passeren zonder de taak op te lossen, een fixer past de verificateur aan om elke ontdekte exploit af te wijzen, en een solver bevestigt dat de gepatchte verificateur nog steeds legitieme oplossingen toelaat. De loop herhaalt zich: elke patch hervormt wat de verificateur beloont, waardoor de volgende exploit aan het licht komt. We voegen verder verificateurstoegang toe en laten patches overdragen tussen taken, om de exploits die de loop ontdekt te verbreden. Op KernelBench drijft de loop het aanvalsuccespercentage van 62% naar 0% op een aparte corpus van openbaar gerapporteerde exploits. We vinden ook dat zwakkere agenten in de loop kunnen verdedigen tegen veel sterkere hackers: de loop van Gemini 3 Flash drijft het aanvalsuccespercentage van de sterkere Gemini 3.1 Pro en Claude Opus 4.7 van respectievelijk 76% en 61% naar 0% op KernelBench, en dat van Gemini 3.1 Pro van 39% naar 17% op Terminal Bench over 77 taken. We brengen Terminal Wrench (323 hackbare omgevingen, 3.632 hacktrajecten) uit als een momentopname van het huidige aanvalsoppervlak, onze gepatchte verificateurs, de exploits die de loop ontdekte, en onze implementatie als basis voor toekomstig werk.

English

Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.