VeriGUI：可验证的长链图形用户界面数据集

摘要

近期研究深入探索了构建能够执行复杂图形用户界面（GUI）计算机任务的自主代理，这些研究有望彻底改变人机交互方式。尽管取得了令人鼓舞的成果，现有工作主要集中于短期交互，并依赖仅基于结果的验证，这限制了它们在现实世界GUI应用中的可扩展性，这些应用通常需要长时程任务分解与执行。本研究中，我们引入了VeriGUI，一个新颖的可验证长链GUI数据集，旨在促进在真实计算机环境中运行的通才GUI代理的开发与评估。我们的数据集强调两个关键维度：(1) 长链复杂性，任务被分解为一系列相互依赖的子任务，跨越数百个步骤，明确设计为允许任何子任务作为有效起点；(2) 子任务级可验证性，支持在每个子任务内采用多样化探索策略，同时确保每个子任务级目标保持可验证且一致。该数据集包含由人类专家标注的桌面和网页GUI任务轨迹。在VeriGUI上使用不同基础模型的各种代理进行的大量实验揭示了在处理长时程任务时存在的显著性能差距，凸显了GUI代理在规划与决策能力方面需要更加鲁棒。

English

Recent studies have delved into constructing autonomous agents capable of performing complex Graphical User Interface (GUI)-based computer tasks, with the potential to revolutionize human-computer interaction. Despite encouraging results, existing efforts mainly focus on short-term interactions and rely on outcome-only verification, thereby limiting their scalability in real-world GUI applications that demand long-horizon task decomposition and execution. In this work, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designed to facilitate the development and evaluation of generalist GUI agents operating in realistic computer environments. Our dataset emphasizes two critical dimensions: (1) long-chain complexity, with tasks decomposed into a sequence of interdependent subtasks spanning hundreds of steps, explicitly designed to allow any subtask to serve as a valid starting point; and (2) subtask-level verifiability, which enables diverse exploration strategies within each subtask, while ensuring that each subtask-level goal remains verifiable and consistent. The dataset consists of GUI task trajectories across both desktop and web, annotated by human experts. Extensive experiments on VeriGUI using various agents with different foundation models reveal significant performance gaps in handling long-horizon tasks, highlighting the need for more robust planning and decision-making capabilities in GUI agents.

VeriGUI：可验证的长链图形用户界面数据集

VeriGUI: Verifiable Long-Chain GUI Dataset

摘要

Support