Terminal Wrench: Een dataset van 331 omgevingen vatbaar voor reward-hacking en 3.632 exploitatiepaden

Samenvatting

Wij brengen Terminal Wrench uit, een subset van 331 terminal-agent benchmark-omgevingen, gekopieerd uit populaire open benchmarks die aantoonbaar vatbaar zijn voor reward-hacking. De dataset omvat 3.632 hacktrajecten en 2.352 legitieme basislijntrajecten van drie frontier-modellen (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4). Elke invoer behoudt de oorspronkelijke taakdefinitie samen met complete aanvalstrajecten die tonen hoe de verifier werd omzeild. Ook zijn er gevallen opgenomen waarin de taak niet zoals bedoeld werd opgelost. De taken beslaan systeembeheer, machine learning, software-engineering en security-uitdagingen; de exploits variëren van simpele output-spoofing tot stack-frame-introspectie, patchen van standaardbibliotheken en rootkit-achtige binary-hijacking. Cruciaal is dat deze exploits specifiek zijn voor elke taak, en niet voor het evaluatieraamwerk, waardoor ze moeilijker te patchen zijn. Wij presenteren ook een monitorbaarheidsstudie waarin hacktrajecten worden gesaneerd of ontdaan van redeneersporen, en vervolgens beoordeeld door een LLM-rechter. Dit toont aan dat de detectie significant verslechtert wanneer de chain-of-thought wordt verwijderd (AUC daalt van 0.97 naar 0.92). De dataset is openbaar beschikbaar op https://github.com/few-sh/terminal-wrench.

English

We release Terminal Wrench, a subset of 331 terminal-agent benchmark environments, copied from the popular open benchmarks that are demonstrably reward-hackable. The data set includes 3,632 hack trajectories and 2,352 legitimate baseline trajectories across three frontier models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4). Each entry preserves the original task definition alongside full attack trajectories that show how the verifier was bypassed. It also includes cases where the task was not solved as intended. The tasks span system administration, machine learning, software engineering, and security challenges; the exploits range from simple output spoofing to stack-frame introspection, standard-library patching, and rootkit-style binary hijacking. Crucially, these exploits are specific to each task, rather than the evaluation harness, making them harder to patch. We also present a monitorability study in which hack trajectories are sanitized or stripped of reasoning traces and then scored by an LLM judge, showing that detection degrades meaningfully when chain-of-thought is removed (AUC drops from 0.97 to 0.92). The data set is publicly available at https://github.com/few-sh/terminal-wrench.

Terminal Wrench: Een dataset van 331 omgevingen vatbaar voor reward-hacking en 3.632 exploitatiepaden

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

Samenvatting

Support