VeriGUI: 검증 가능한 장기간 GUI 데이터셋

초록

최근 연구들은 복잡한 그래픽 사용자 인터페이스(GUI) 기반 컴퓨터 작업을 수행할 수 있는 자율 에이전트 구축에 깊이 관여하며, 이는 인간-컴퓨터 상호작용에 혁신을 가져올 잠재력을 가지고 있습니다. 고무적인 결과에도 불구하고, 기존 연구들은 주로 단기 상호작용에 초점을 맞추고 결과만을 검증하는 방식에 의존함으로써, 장기적인 작업 분해와 실행을 요구하는 실제 GUI 애플리케이션에서의 확장성을 제한하고 있습니다. 본 연구에서는 현실적인 컴퓨터 환경에서 작동하는 일반적인 GUI 에이전트의 개발과 평가를 촉진하기 위해 설계된 새로운 검증 가능한 장기 체인 GUI 데이터셋인 VeriGUI를 소개합니다. 우리의 데이터셋은 두 가지 중요한 차원을 강조합니다: (1) 수백 단계에 걸친 상호의존적인 하위 작업 시퀀스로 분해된 장기 체인 복잡성으로, 모든 하위 작업이 유효한 시작점으로 작용할 수 있도록 명시적으로 설계되었으며; (2) 각 하위 작업 내에서 다양한 탐색 전략을 가능하게 하면서도 각 하위 작업 수준의 목표가 검증 가능하고 일관되도록 하는 하위 작업 수준 검증 가능성입니다. 이 데이터셋은 데스크톱과 웹을 아우르는 GUI 작업 궤적으로 구성되어 있으며, 인간 전문가에 의해 주석이 달렸습니다. 다양한 기반 모델을 가진 여러 에이전트를 사용한 VeriGUI에 대한 광범위한 실험은 장기 작업 처리에서 상당한 성능 격차를 드러내며, GUI 에이전트에서 더 강력한 계획 및 의사결정 능력의 필요성을 강조합니다.

English

Recent studies have delved into constructing autonomous agents capable of performing complex Graphical User Interface (GUI)-based computer tasks, with the potential to revolutionize human-computer interaction. Despite encouraging results, existing efforts mainly focus on short-term interactions and rely on outcome-only verification, thereby limiting their scalability in real-world GUI applications that demand long-horizon task decomposition and execution. In this work, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designed to facilitate the development and evaluation of generalist GUI agents operating in realistic computer environments. Our dataset emphasizes two critical dimensions: (1) long-chain complexity, with tasks decomposed into a sequence of interdependent subtasks spanning hundreds of steps, explicitly designed to allow any subtask to serve as a valid starting point; and (2) subtask-level verifiability, which enables diverse exploration strategies within each subtask, while ensuring that each subtask-level goal remains verifiable and consistent. The dataset consists of GUI task trajectories across both desktop and web, annotated by human experts. Extensive experiments on VeriGUI using various agents with different foundation models reveal significant performance gaps in handling long-horizon tasks, highlighting the need for more robust planning and decision-making capabilities in GUI agents.