Workflow-GYM：朝向真實專業領域中電腦使用代理任務的長期視野評估

摘要

近年來，AI代理在處理日益複雜的現實任務方面發展迅速。然而，現有基準測試很少評估代理是否能操作圖形使用者介面，以完成跨越多個領域的長期、高價值專業工作流程。當前的GUI基準測試仍主要聚焦於通用型軟體、相對簡單的應用程式及短期任務，因此現代代理能否遵循用戶指令，自主操作領域特定的專業軟體，並以端到端方式完成具有經濟價值的工作，仍是個未知數。為填補此缺口，我們推出Workflow-GYM，這是一個以專業領域與專業軟體環境為核心的長期GUI任務基準測試。透過對最先進模型進行廣泛實驗，我們發現即使是最強大的模型，其成功率也僅略高於30%，凸顯出專業的長期GUI工作流程對當前GUI代理而言仍極具挑戰性。進一步分析顯示，現有代理難以維持長期工作流程的一致性，經常出現工作流程階段遺漏、錯誤傳播、目標漂移，以及對專業軟體環境理解不足等問題。我們的研究結果為當前代理系統的局限性提供了重要見解，並為下一代GUI代理研究指出了關鍵方向。

English

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.