TerminalWorld: Het benchmarken van agenten op realistische terminaltaken

Samenvatting

We introduceren TerminalWorld, een schaalbare data-engine die automatisch evaluatietaken met hoge betrouwbaarheid reverse-engineert uit terminalopnames 'in het wild'. Door 80.870 terminalopnames te verwerken, levert de engine een volledige benchmark van 1.530 gevalideerde taken, die 18 categorieën uit de echte wereld bestrijken, variërend van korte alledaagse operaties tot workflows van meer dan 50 stappen, en die 1.280 unieke commando's omvatten. Hieruit stellen we een geverifieerde subset samen van 200 representatieve, handmatig beoordeelde taken. Uitgebreide benchmarking op TerminalWorld-Verified over acht frontier-modellen en zes agents toont aan dat huidige systemen nog steeds moeite hebben met authentieke terminalworkflows, met een maximaal slagingspercentage van slechts 62,5%. Bovendien legt TerminalWorld terminalcapaciteiten uit de echte wereld vast die verschillen van bestaande, door experts samengestelde benchmarks (bijv. Terminal-Bench), met slechts een zwakke correlatie met hun scores (Pearson r=0,20). De geautomatiseerde engine maakt TerminalWorld door constructie authentiek en schaalbaar, waardoor het agents kan evalueren in terminalomgevingen uit de echte wereld naarmate ontwikkelpraktijken evolueren. Gegevens en code zijn beschikbaar op https://github.com/EuniAI/TerminalWorld.

English

We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.