TerminalWorld:在真實終端任務上評估智能體
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
May 21, 2026
作者: Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O'Hearn, Earl T. Barr, Mark Harman, Federica Sarro, He Ye
cs.AI
摘要
我們推出了TerminalWorld,這是一個可擴展的資料引擎,能自動從「真實世界」的終端機錄製中逆向工程出高保真度的評估任務。透過處理80,870筆終端機錄製,該引擎產出了一套完整的基準測試,包含1,530個經驗證的任務,涵蓋18個真實世界類別,範圍從簡短的日常操作到超過50個步驟的工作流程,並覆蓋1,280個獨特指令。從中,我們精心挑選出一個由200個具代表性且經人工審查的任務組成的Verified子集。在TerminalWorld-Verified上對八個前沿模型和六個智能體進行全面基準測試後發現,現有系統在處理真實終端工作流程時仍力有未逮,最高通過率僅達62.5%。此外,TerminalWorld所捕捉到的真實終端能力與現有專家策劃的基準測試(例如Terminal-Bench)有別,與其分數僅呈現弱相關(皮爾森相關係數r=0.20)。透過建構方式,自動化引擎使TerminalWorld具備真實性與可擴展性,使其能在開發者實務演進的過程中,於真實終端環境下評估智能體。資料與程式碼可於 https://github.com/EuniAI/TerminalWorld 取得。
English
We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.