TerminalWorld:在真实终端任务上对智能体进行基准测试
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
May 21, 2026
作者: Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O'Hearn, Earl T. Barr, Mark Harman, Federica Sarro, He Ye
cs.AI
摘要
我们推出了**TerminalWorld**,一个可扩展的数据引擎,能够自动从“野外”终端记录中逆向工程出高保真度的评估任务。通过处理80,870条终端记录,该引擎生成了一个包含1,530个经过验证的任务的完整基准测试集,涵盖18个真实世界类别,从短期的日常操作到超过50个步骤的工作流程,并覆盖了1,280个独特命令。从中,我们精心挑选了一个包含200个代表性任务、经过人工审查的**Verified**子集。在**TerminalWorld-Verified**上对八个前沿模型和六个智能体进行全面基准测试表明,当前系统在处理真实的终端工作流程时仍然存在困难,最高通过率仅为62.5%。此外,**TerminalWorld**捕捉到了与现有专家策划的基准测试(如**Terminal-Bench**)不同的真实终端能力,与这些基准测试的分数相关性较弱(皮尔逊相关系数r=0.20)。该自动化引擎使得**TerminalWorld**在构建上具备真实性和可扩展性,从而能够随着开发者实践的发展,在真实终端环境中评估智能体。数据和代码可在 https://github.com/EuniAI/TerminalWorld 获取。
English
We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.