使用对抗性黑客-修复循环强化智能体基准

摘要

智能体基准测试通过结果验证器对提交内容进行评分，这些验证器通常手工编写且脆弱，容易遭受奖励篡改攻击。我们对五个终端智能体基准测试中的1,968个任务进行审计，发现其中323个（16%）仅凭任务描述即可被前沿模型破解。这不仅破坏了排行榜排名，还污染了强化学习训练信号，然而标准做法仍是手动且被动的应对。我们提出了一种名为"攻防循环"的方法，用于构建抗攻击验证器，无需针对每个任务进行手动修补。该循环交替使用三个大语言模型智能体：攻击者尝试在不完成任务的情况下通过验证器，修复者对验证器进行修补以拒绝每个已发现的漏洞，求解者则确认修补后的验证器仍能接受合法解决方案。该循环迭代运行：每次修补都会重新定义验证器的奖励机制，从而暴露下一个漏洞。我们进一步添加了验证器访问权限，并允许修补在不同任务间迁移，以扩展循环所能发现的漏洞范围。在KernelBench上，该循环将已公开报告的漏洞语料库中的攻击成功率从62%降至0%。我们还发现，循环中较弱的智能体能够抵御更强的攻击者：Gemini 3 Flash的循环使更强的Gemini 3.1 Pro和Claude Opus 4.7在KernelBench上的攻击成功率分别从76%和61%降至0%；而在Terminal Bench的77个任务中，Gemini 3.1 Pro的攻击成功率从39%降至17%。我们发布了Terminal Wrench（323个可破解环境，3,632条攻击轨迹），作为当前攻击面的快照，同时公开了我们的修补后验证器、循环发现的漏洞以及实现代码，作为未来工作的基础。

English

Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.