Workflow-GYM：面向真实世界专业领域中计算机操作型代理任务的长期评估

摘要

近年来，AI智能体在处理日益复杂的现实任务方面取得了飞速发展。然而，现有基准测试很少评估智能体能否操作图形用户界面，跨领域完成长期的、高价值的专业工作流程。当前的GUI基准仍主要聚焦于通用软件、相对简单的应用和短周期任务，因此尚不清楚现代智能体是否能够遵循用户指令，自主操作特定领域的专业软件，并以端到端方式完成具有经济价值的工作。为弥补这一空白，我们提出了Workflow-GYM——一个面向专业领域和专用软件环境的长期GUI任务基准。通过对最先进模型进行大量实验，我们发现即使是最强的模型，其成功率也仅略高于30%，这表明当前GUI智能体在处理专业的长周期GUI工作流程方面仍面临巨大挑战。进一步分析显示，现有智能体难以维持长时间工作流程的一致性，频繁出现流程阶段遗漏、错误传播、目标漂移以及对专业软件环境理解不足等问题。我们的研究结果为当前智能体系统的局限性提供了重要见解，并为下一代GUI智能体研究指明了关键方向。

English

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.