敵対的ハッカー・フィクサー・ループによるエージェントベンチマークの強化

要旨

エージェントベンチマークは、通常手動で作成され脆弱な成果検証器を用いてスコアを評価するため、報酬ハッキングの余地が残されている。我々は5つのターミナルエージェントベンチマークにわたる1,968のタスクを監査し、323件(16%)が最前線モデルに対してタスク説明のみでハッキング可能であることを発見した。これはリーダーボードの順位と強化学習の学習信号の両方を損なうが、標準的な対応は手動かつ事後的である。我々は、タスクごとの手動修正を必要としない、耐エクスプロイト検証器を構築する手法であるハッカー・フィクサーループを導入する。このループは3つのLLMエージェントを交互に動作させる。ハッカーはタスクを解かずに検証器を通過しようと試み、フィクサーは発見されたエクスプロイトを拒否するよう検証器にパッチを適用し、ソルバーはパッチ適用後の検証器が正当な解を依然として受理することを確認する。このループは反復される。各パッチは検証器が報酬を与える対象を再形成し、次のエクスプロイトを表面化させる。さらに、検証器へのアクセス権を追加し、パッチをタスク間で転送可能にすることで、ループが発見するエクスプロイトの範囲を拡大する。 KernelBenchでは、このループにより、公開報告されたエクスプロイトのホールドアウトコーパスにおいて攻撃成功率が62%から0%に低下した。また、ループ内でより弱いエージェントでも、はるかに強力なハッカーに対して防御可能であることが分かった。Gemini 3 Flashのループは、より強力なGemini 3.1 ProとClaude Opus 4.7の攻撃成功率をKernelBenchでそれぞれ76%と61%から0%に低下させ、Gemini 3.1 Proの攻撃成功率はTerminal Bench上の77タスクで39%から17%に低下した。我々は、現在の攻撃対象領域のスナップショットとしてTerminal Wrench（323のハッキング可能環境、3,632のハッキング軌跡）、パッチ適用済み検証器、ループが発見したエクスプロイト、および将来の研究の基盤としての実装を公開する。

English

Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.