VeriGUI: 検証可能なロングチェーンGUIデータセット

要旨

近年の研究では、複雑なグラフィカルユーザーインターフェース（GUI）ベースのコンピュータタスクを実行可能な自律エージェントの構築が進められており、人間とコンピュータの相互作用に革命をもたらす可能性があります。有望な結果が得られているものの、既存の取り組みは主に短期的な相互作用に焦点を当てており、結果のみの検証に依存しているため、長期的なタスクの分解と実行を必要とする現実世界のGUIアプリケーションにおけるスケーラビリティが制限されています。本研究では、現実的なコンピュータ環境で動作する汎用GUIエージェントの開発と評価を促進するために設計された、検証可能な長鎖GUIデータセット「VeriGUI」を紹介します。このデータセットは、以下の2つの重要な次元を強調しています：（1）長鎖の複雑性。タスクは相互依存するサブタスクのシーケンスに分解され、数百のステップにわたるように設計されており、どのサブタスクも有効な開始点として機能できるよう明示的に設計されています。（2）サブタスクレベルの検証可能性。各サブタスク内で多様な探索戦略を可能にしつつ、各サブタスクレベルの目標が検証可能で一貫性を保つようにしています。このデータセットは、デスクトップとウェブの両方にわたるGUIタスクの軌跡で構成され、人間の専門家によって注釈が付けられています。VeriGUIを用いたさまざまな基盤モデルを持つエージェントによる広範な実験では、長期的なタスクの処理において大きな性能差が明らかになり、GUIエージェントにおけるより堅牢な計画と意思決定能力の必要性が浮き彫りになりました。

English

Recent studies have delved into constructing autonomous agents capable of performing complex Graphical User Interface (GUI)-based computer tasks, with the potential to revolutionize human-computer interaction. Despite encouraging results, existing efforts mainly focus on short-term interactions and rely on outcome-only verification, thereby limiting their scalability in real-world GUI applications that demand long-horizon task decomposition and execution. In this work, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designed to facilitate the development and evaluation of generalist GUI agents operating in realistic computer environments. Our dataset emphasizes two critical dimensions: (1) long-chain complexity, with tasks decomposed into a sequence of interdependent subtasks spanning hundreds of steps, explicitly designed to allow any subtask to serve as a valid starting point; and (2) subtask-level verifiability, which enables diverse exploration strategies within each subtask, while ensuring that each subtask-level goal remains verifiable and consistent. The dataset consists of GUI task trajectories across both desktop and web, annotated by human experts. Extensive experiments on VeriGUI using various agents with different foundation models reveal significant performance gaps in handling long-horizon tasks, highlighting the need for more robust planning and decision-making capabilities in GUI agents.