利用對抗性黑客-修復循環強化智能體基準

摘要

Agent 基准测试通过通常手动编写且脆弱的输出验证器来对提交进行评分，这使得它们容易受到奖励利用攻击。我们对五个终端代理基准测试中的 1,968 个任务进行了审计，发现其中 323 个（16%）可以被前沿模型仅凭任务描述攻破。这不仅破坏了排行榜排名，也污染了强化学习训练信号，但标准的应对方式仍是手动且被动的。我们引入了“黑客-修复者循环”方法，这是一种无需针对每个任务进行手动修补即可构建抗利用验证器的方法。该循环交替使用三个大语言模型代理：黑客试图在不完成任务的情况下通过验证器，修复者对验证器进行修补以拒绝每个已发现的利用手段，而求解者则确认修补后的验证器仍然承认合法的解决方案。循环迭代进行：每次修补都会重塑验证器所奖励的内容，从而暴露出下一个利用手段。我们进一步增加了对验证器的访问权限，并允许补丁在不同任务之间转移，以扩展该循环能发现的利用手段范围。在 KernelBench 上，该循环将公开报告利用集合中保留数据集的攻击成功率从 62% 降至 0%。我们还发现，循环中使用较弱的代理也能抵御更强的黑客：使用 Gemini 3 Flash 的循环，在 KernelBench 上，使更强的 Gemini 3.1 Pro 和 Claude Opus 4.7 的攻击成功率分别从 76% 和 61% 降至 0%；而在 Terminal Bench 的 77 个任务中，使 Gemini 3.1 Pro 的攻击成功率从 39% 降至 17%。我们发布了 Terminal Wrench（包含 323 个可攻破环境、3,632 条黑客攻击轨迹），作为当前攻击面的快照，同时发布了修补后的验证器、该循环发现的利用手段以及我们的实现代码，作为未来研究的基础。

English

Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.