GTA1:圖形用戶界面測試時縮放代理
GTA1: GUI Test-time Scaling Agent
July 8, 2025
作者: Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Caiming Xiong, Junnan Li
cs.AI
摘要
圖形用戶界面(GUI)代理能夠自主跨平台(如Linux)操作,通過與視覺元素互動來完成任務。具體而言,用戶指令被分解為一系列動作提案,每個提案對應於與GUI的一次交互。每次動作後,代理會觀察更新後的GUI環境以規劃下一步。然而,面臨兩大挑戰:一是任務規劃中的歧義消解(即動作提案序列),在眾多可能有效的方案中選擇合適的計劃並非易事;二是在複雜且高分辨率的界面中精確地將動作落地,即準確地與視覺目標進行交互。
本文針對上述兩大挑戰,探討了我們的GUI測試時縮放代理,即GTA1。首先,為選取最合適的動作提案,我們引入了一種測試時縮放方法。在每一步,我們採樣多個候選動作提案,並利用一個評判模型來評估並選取最適宜的方案。該方法通過並行採樣在計算與決策質量之間取得平衡,縮短任務執行步驟,提升整體性能。其次,我們提出了一種模型,在將選定的動作提案對應到其視覺元素時,實現了更高的準確性。我們的核心洞見是,強化學習(RL)通過內在的目標對齊促進了視覺落地,獎勵成功點擊界面元素的行為。
實驗結果表明,我們的方法在多樣化的基準測試中均達到了頂尖水平。例如,GTA1-7B在Screenspot-Pro、Screenspot-V2和OSWorld-G上的準確率分別為50.1%、92.4%和67.7%。當與應用我們測試時縮放策略的規劃器配合使用時,它展現了頂尖的代理性能(如在OSWorld上達到45.2%的任務成功率)。我們在此開源了我們的代碼和模型。
English
Graphical user interface (GUI) agents autonomously operate across platforms
(e.g., Linux) to complete tasks by interacting with visual elements.
Specifically, a user instruction is decomposed into a sequence of action
proposals, each corresponding to an interaction with the GUI. After each
action, the agent observes the updated GUI environment to plan the next step.
However, two main challenges arise: i) resolving ambiguity in task planning
(i.e., the action proposal sequence), where selecting an appropriate plan is
non-trivial, as many valid ones may exist; ii) accurately grounding actions in
complex and high-resolution interfaces, i.e., precisely interacting with visual
targets.
This paper investigates the two aforementioned challenges with our GUI
Test-time Scaling Agent, namely GTA1. First, to select the most appropriate
action proposal, we introduce a test-time scaling method. At each step, we
sample multiple candidate action proposals and leverage a judge model to
evaluate and select the most suitable one. It trades off computation for better
decision quality by concurrent sampling, shortening task execution steps, and
improving overall performance. Second, we propose a model that achieves
improved accuracy when grounding the selected action proposal to its
corresponding visual elements. Our key insight is that reinforcement learning
(RL) facilitates visual grounding through inherent objective alignments,
rewarding successful clicks on interface elements.
Experimentally, our method establishes state-of-the-art performance across
diverse benchmarks. For example, GTA1-7B achieves 50.1%, 92.4%, and 67.7%
accuracies on Screenspot-Pro, Screenspot-V2, and OSWorld-G, respectively. When
paired with a planner applying our test-time scaling strategy, it exhibits
state-of-the-art agentic performance (e.g., 45.2% task success rate on
OSWorld). We open-source our code and models here.