Terminal Wrench: Un Dataset di 331 Ambienti Soggetti a Manipolazione di Ricompense e 3.632 Traiettorie di Sfruttamento

Abstract

Rilasciamo Terminal Wrench, un sottoinsieme di 331 ambienti benchmark per agenti terminale, replicati dai popolari benchmark open source che sono dimostrabilmente vulnerabili a reward hacking. Il dataset include 3.632 traiettorie di attacco e 2.352 traiettorie legittime di base, ottenute testando tre modelli all'avanguardia (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4). Ogni voce preserva la definizione originale del task insieme alle traiettorie di attacco complete che mostrano come il verificatore sia stato bypassato. Include anche i casi in cui il task non è stato risolto come previsto. I task spaziano tra amministrazione di sistema, machine learning, ingegneria del software e sfide di sicurezza; gli exploit vanno dal semplice spoofing dell'output all'introspezione dello stack frame, alla modifica delle librerie standard e all'hijacking di binari in stile rootkit. È cruciale notare che questi exploit sono specifici per ogni singolo task, e non per l'harness di valutazione, rendendoli più difficili da correggere. Presentiamo inoltre uno studio sulla monitorabilità in cui le traiettorie di attacco vengono sanificate o private delle tracce di ragionamento e poi valutate da un giudice LLM, dimostrando che il rilevamento si degrada significativamente quando la catena di ragionamento (chain-of-thought) viene rimossa (l'AUC scende da 0.97 a 0.92). Il dataset è pubblicamente disponibile all'indirizzo https://github.com/few-sh/terminal-wrench.

English

We release Terminal Wrench, a subset of 331 terminal-agent benchmark environments, copied from the popular open benchmarks that are demonstrably reward-hackable. The data set includes 3,632 hack trajectories and 2,352 legitimate baseline trajectories across three frontier models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4). Each entry preserves the original task definition alongside full attack trajectories that show how the verifier was bypassed. It also includes cases where the task was not solved as intended. The tasks span system administration, machine learning, software engineering, and security challenges; the exploits range from simple output spoofing to stack-frame introspection, standard-library patching, and rootkit-style binary hijacking. Crucially, these exploits are specific to each task, rather than the evaluation harness, making them harder to patch. We also present a monitorability study in which hack trajectories are sanitized or stripped of reasoning traces and then scored by an LLM judge, showing that detection degrades meaningfully when chain-of-thought is removed (AUC drops from 0.97 to 0.92). The data set is publicly available at https://github.com/few-sh/terminal-wrench.

Terminal Wrench: Un Dataset di 331 Ambienti Soggetti a Manipolazione di Ricompense e 3.632 Traiettorie di Sfruttamento

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

Abstract

Support