적대적 해커-수정자 루프를 이용한 보안 강화 에이전트 벤치마크

초록

에이전트 벤치마크는 일반적으로 수작업으로 작성되어 취약한 결과 검증기를 사용하여 점수를 평가하므로, 리워드 해킹에 노출됩니다. 우리는 5개의 터미널 기반 에이전트 벤치마크에서 1,968개의 과제를 감사한 결과, 태스크 설명만으로도 프론티어 모델이 해킹 가능한 323개(16%)의 과제를 발견했습니다. 이는 리더보드 순위와 강화학습 훈련 신호를 모두 왜곡하지만, 표준 대응은 수동적이고 반응적인 방식에 머물러 있습니다. 우리는 해커-수정자 루프(hacker-fixer loop)를 도입합니다. 이는 과제별 수동 패치 없이도 익스플로잇에 강한 검증기를 구축하는 방법입니다. 루프는 세 가지 LLM 에이전트를 번갈아 사용합니다. 해커는 태스크를 해결하지 않고 검증기를 통과하려 시도하고, 수정자는 발견된 각 익스플로잇을 거부하도록 검증기를 패치하며, 해결사는 패치된 검증기가 여전히 정당한 해결책을 허용하는지 확인합니다. 루프는 반복됩니다. 각 패치는 검증기가 보상하는 대상을 재정의하여 다음 익스플로잇을 드러냅니다. 또한 검증기 접근 권한을 추가하고 패치가 여러 태스크 간에 전이되도록 하여, 루프가 발견하는 익스플로잇의 범위를 넓힙니다. KernelBench에서 이 루프는 공개적으로 보고된 익스플로잇의 홀드아웃 코퍼스에 대해 공격 성공률을 62%에서 0%로 낮춥니다. 또한 루프 내에서 약한 에이전트가 훨씬 강력한 해커에 대해 방어할 수 있음을 발견했습니다. Gemini 3 Flash의 루프는 더 강력한 Gemini 3.1 Pro와 Claude Opus 4.7의 KernelBench 공격 성공률을 각각 76%와 61%에서 0%로 낮추었고, Terminal Bench의 77개 태스크에서는 Gemini 3.1 Pro의 공격 성공률을 39%에서 17%로 낮추었습니다. 우리는 Terminal Wrench(323개의 해킹 가능 환경, 3,632개의 해킹 궤적)를 현재 공격 표면의 스냅샷으로, 패치된 검증기, 루프가 발견한 익스플로잇, 그리고 향후 연구를 위한 기반으로서의 구현체를 함께 공개합니다.

English

Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.