VeriGUI：可驗證的長鏈GUI數據集

摘要

近期研究深入探討了構建能夠執行基於圖形用戶界面（GUI）的複雜電腦任務的自動化代理，這有望徹底改變人機互動方式。儘管取得了令人鼓舞的成果，現有研究主要集中於短期互動，並依賴於僅基於結果的驗證，這限制了其在現實世界GUI應用中的可擴展性，這些應用需要長時程任務的分解與執行。在本研究中，我們引入了VeriGUI，這是一個新穎的可驗證長鏈GUI數據集，旨在促進在真實電腦環境中運行的通用GUI代理的開發與評估。我們的數據集強調兩個關鍵維度：（1）長鏈複雜性，任務被分解為一系列相互依賴的子任務，跨越數百個步驟，明確設計為允許任何子任務作為有效的起點；（2）子任務級別的可驗證性，這使得在每個子任務內能夠進行多樣化的探索策略，同時確保每個子任務級別的目標保持可驗證且一致。該數據集由人類專家註釋的桌面和網頁GUI任務軌跡組成。在VeriGUI上使用不同基礎模型的各種代理進行的大量實驗揭示了在處理長時程任務時顯著的性能差距，突顯了GUI代理在規劃和決策能力方面需要更加強健的需求。

English

Recent studies have delved into constructing autonomous agents capable of performing complex Graphical User Interface (GUI)-based computer tasks, with the potential to revolutionize human-computer interaction. Despite encouraging results, existing efforts mainly focus on short-term interactions and rely on outcome-only verification, thereby limiting their scalability in real-world GUI applications that demand long-horizon task decomposition and execution. In this work, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designed to facilitate the development and evaluation of generalist GUI agents operating in realistic computer environments. Our dataset emphasizes two critical dimensions: (1) long-chain complexity, with tasks decomposed into a sequence of interdependent subtasks spanning hundreds of steps, explicitly designed to allow any subtask to serve as a valid starting point; and (2) subtask-level verifiability, which enables diverse exploration strategies within each subtask, while ensuring that each subtask-level goal remains verifiable and consistent. The dataset consists of GUI task trajectories across both desktop and web, annotated by human experts. Extensive experiments on VeriGUI using various agents with different foundation models reveal significant performance gaps in handling long-horizon tasks, highlighting the need for more robust planning and decision-making capabilities in GUI agents.

VeriGUI：可驗證的長鏈GUI數據集

VeriGUI: Verifiable Long-Chain GUI Dataset

摘要

Support