GTA1: GUIテスト時スケーリングエージェント

要旨

グラフィカルユーザーインターフェース（GUI）エージェントは、プラットフォーム（例：Linux）を横断して自律的に動作し、視覚要素との相互作用を通じてタスクを完了します。具体的には、ユーザーの指示は一連のアクションプロポーザルに分解され、それぞれがGUIとの相互作用に対応します。各アクションの後、エージェントは更新されたGUI環境を観察して次のステップを計画します。しかし、二つの主要な課題が生じます：i）タスクプランニング（すなわち、アクションプロポーザルのシーケンス）における曖昧さの解決。適切なプランを選択することは容易ではなく、多くの有効なプランが存在する可能性があります；ii）複雑で高解像度のインターフェースにおいて、アクションを正確にグラウンディングすること、すなわち、視覚ターゲットと正確に相互作用すること。本論文では、GUIテストタイムスケーリングエージェント、すなわちGTA1を用いて、前述の二つの課題を調査します。まず、最も適切なアクションプロポーザルを選択するために、テストタイムスケーリング手法を導入します。各ステップで、複数の候補アクションプロポーザルをサンプリングし、ジャッジモデルを活用して最も適切なものを評価・選択します。これにより、並行サンプリングを通じて計算を犠牲にして意思決定の質を向上させ、タスク実行ステップを短縮し、全体のパフォーマンスを向上させます。次に、選択されたアクションプロポーザルを対応する視覚要素にグラウンディングする際に、精度を向上させるモデルを提案します。我々の重要な洞察は、強化学習（RL）が、インターフェース要素の成功したクリックを報酬として、視覚的グラウンディングを促進するというものです。実験的に、我々の手法は多様なベンチマークにおいて最先端のパフォーマンスを確立します。例えば、GTA1-7Bは、Screenspot-Pro、Screenspot-V2、OSWorld-Gにおいて、それぞれ50.1%、92.4%、67.7%の精度を達成します。テストタイムスケーリング戦略を適用したプランナーと組み合わせると、最先端のエージェント性能を示します（例：OSWorldでの45.2%のタスク成功率）。我々はコードとモデルをここでオープンソース化しています。

English

Graphical user interface (GUI) agents autonomously operate across platforms (e.g., Linux) to complete tasks by interacting with visual elements. Specifically, a user instruction is decomposed into a sequence of action proposals, each corresponding to an interaction with the GUI. After each action, the agent observes the updated GUI environment to plan the next step. However, two main challenges arise: i) resolving ambiguity in task planning (i.e., the action proposal sequence), where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, i.e., precisely interacting with visual targets. This paper investigates the two aforementioned challenges with our GUI Test-time Scaling Agent, namely GTA1. First, to select the most appropriate action proposal, we introduce a test-time scaling method. At each step, we sample multiple candidate action proposals and leverage a judge model to evaluate and select the most suitable one. It trades off computation for better decision quality by concurrent sampling, shortening task execution steps, and improving overall performance. Second, we propose a model that achieves improved accuracy when grounding the selected action proposal to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates visual grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, our method establishes state-of-the-art performance across diverse benchmarks. For example, GTA1-7B achieves 50.1%, 92.4%, and 67.7% accuracies on Screenspot-Pro, Screenspot-V2, and OSWorld-G, respectively. When paired with a planner applying our test-time scaling strategy, it exhibits state-of-the-art agentic performance (e.g., 45.2% task success rate on OSWorld). We open-source our code and models here.

GTA1: GUIテスト時スケーリングエージェント

GTA1: GUI Test-time Scaling Agent

要旨

Support