GTA1:图形用户界面测试时缩放代理
GTA1: GUI Test-time Scaling Agent
July 8, 2025
作者: Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Caiming Xiong, Junnan Li
cs.AI
摘要
图形用户界面(GUI)代理能够跨平台(如Linux)自主操作,通过视觉元素交互完成任务。具体而言,用户指令被分解为一系列动作提案,每个提案对应一次与GUI的交互。每次动作后,代理观察更新后的GUI环境以规划下一步。然而,面临两大挑战:一是在任务规划(即动作提案序列)中消除歧义,选择合适方案并非易事,因可能存在多个有效选项;二是在复杂高分辨率界面中精准定位动作,即精确与视觉目标交互。
本文针对上述挑战,提出了我们的GUI测试时缩放代理,简称GTA1。首先,为选取最合适的动作提案,我们引入了一种测试时缩放方法。每一步中,我们采样多个候选动作提案,并利用评判模型评估并选择最适宜者。该方法通过并行采样在计算与决策质量间取得平衡,缩短任务执行步骤,提升整体性能。其次,我们提出了一种模型,在将选定动作提案与其对应视觉元素精准对接时,实现了更高的准确率。我们的核心洞见在于,强化学习(RL)通过内在目标对齐促进视觉定位,奖励成功点击界面元素。
实验表明,我们的方法在多种基准测试中确立了领先性能。例如,GTA1-7B在Screenspot-Pro、Screenspot-V2和OSWorld-G上的准确率分别达到50.1%、92.4%和67.7%。当结合采用我们测试时缩放策略的规划器时,它展现了顶尖的代理性能(如在OSWorld上达到45.2%的任务成功率)。我们在此开源了代码与模型。
English
Graphical user interface (GUI) agents autonomously operate across platforms
(e.g., Linux) to complete tasks by interacting with visual elements.
Specifically, a user instruction is decomposed into a sequence of action
proposals, each corresponding to an interaction with the GUI. After each
action, the agent observes the updated GUI environment to plan the next step.
However, two main challenges arise: i) resolving ambiguity in task planning
(i.e., the action proposal sequence), where selecting an appropriate plan is
non-trivial, as many valid ones may exist; ii) accurately grounding actions in
complex and high-resolution interfaces, i.e., precisely interacting with visual
targets.
This paper investigates the two aforementioned challenges with our GUI
Test-time Scaling Agent, namely GTA1. First, to select the most appropriate
action proposal, we introduce a test-time scaling method. At each step, we
sample multiple candidate action proposals and leverage a judge model to
evaluate and select the most suitable one. It trades off computation for better
decision quality by concurrent sampling, shortening task execution steps, and
improving overall performance. Second, we propose a model that achieves
improved accuracy when grounding the selected action proposal to its
corresponding visual elements. Our key insight is that reinforcement learning
(RL) facilitates visual grounding through inherent objective alignments,
rewarding successful clicks on interface elements.
Experimentally, our method establishes state-of-the-art performance across
diverse benchmarks. For example, GTA1-7B achieves 50.1%, 92.4%, and 67.7%
accuracies on Screenspot-Pro, Screenspot-V2, and OSWorld-G, respectively. When
paired with a planner applying our test-time scaling strategy, it exhibits
state-of-the-art agentic performance (e.g., 45.2% task success rate on
OSWorld). We open-source our code and models here.