Workflow-GYM: 실제 전문 분야에서 컴퓨터 사용 에이전트 작업의 장기 평가를 향하여

초록

최근 몇 년 동안 AI 에이전트가 점점 더 복잡하고 현실적인 작업을 처리하는 방향으로 빠르게 진화해 왔다. 그러나 기존 벤치마크는 에이전트가 그래픽 사용자 인터페이스를 조작하여 다양한 도메인에서 장기적이고 고부가가치의 전문 작업 흐름을 완료할 수 있는지 여부를 거의 평가하지 않는다. 현재의 GUI 벤치마크는 여전히 주로 범용 소프트웨어, 비교적 단순한 애플리케이션, 그리고 단기 작업에 초점을 맞추고 있어, 현대 에이전트가 사용자 지침에 따라 도메인 특화 전문 소프트웨어를 자율적으로 조작하고 경제적으로 가치 있는 작업을 종단 간 방식으로 수행할 수 있는지 여부는 대부분 알려져 있지 않다. 이러한 격차를 해소하기 위해, 우리는 전문 도메인과 특화된 소프트웨어 환경에 초점을 맞춘 장기 GUI 작업을 위한 벤치마크인 Workflow-GYM을 소개한다. 최첨단 모델에 대한 광범위한 실험을 통해, 가장 강력한 모델조차도 30%를 약간 상회하는 성공률만을 달성함을 발견하였으며, 이는 전문적인 장기 GUI 작업 흐름이 현재의 GUI 에이전트에게 여전히 매우 어려운 과제임을 강조한다. 추가 분석에 따르면, 현재 에이전트는 장기 작업 흐름의 일관성을 유지하는 데 어려움을 겪으며, 작업 단계 누락, 오류 전파, 목표 이탈, 그리고 전문 소프트웨어 환경에 대한 이해 부족을 자주 보인다. 우리의 발견은 현재 에이전트 시스템의 한계에 대한 중요한 통찰력을 제공하며, 차세대 GUI 에이전트 연구를 위한 핵심 방향을 제시한다.

English

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.