VLAA-GUI:掌握停止、恢复与搜索时机的模块化图形用户界面自动化框架
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
April 23, 2026
作者: Qijun Han, Haoqin Tu, Zijun Wang, Haoyue Dai, Yiyang Zhou, Nancy Lau, Alvaro A. Cardenas, Yuhui Xu, Ran Xu, Caiming Xiong, Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, Cihang Xie
cs.AI
摘要
自主GUI智能体面临两大核心挑战:过早终止(智能体在缺乏可验证证据时提前宣告成功)与循环重复(智能体在相同失败操作中陷入死循环而无法恢复)。我们提出VLAA-GUI——一个围绕三大集成组件构建的模块化GUI智能体框架,通过"停止、恢复、搜索"三重机制引导系统决策。首先,强制性完成度验证器在每步结束时执行基于UI界面的成功标准核验:其智能体级验证器通过决策规则交叉审阅完成声明,拒绝缺乏直接视觉证据的结论。其次,强制性循环中断器提供多级过滤机制:在重复失败后切换交互模式,在屏幕状态持续复现时强制改变策略,并将反思信号与策略调整绑定。第三,按需启用的搜索代理可通过直接向具备搜索能力的大语言模型查询,为陌生工作流程进行在线搜索并以纯文本返回结果。我们还集成了按需调用的编码代理(处理代码密集型操作)与 grounding代理(实现精准操作定位)。在包含Linux和Windows任务的两种基准测试中,VLAA-GUI在Opus 4.5、4.6及Gemini 3.1 Pro等五大顶级骨干模型上均取得最优性能(OSWorld达77.5%,WindowsAgentArena达61.0%)。值得注意的是,五个骨干模型中有三个在OSWorld上单次通过即超越人类表现(72.4%)。消融实验表明,所有三个组件均能持续增强强骨干模型性能,而弱骨干模型在步数预算充足时从这些工具中获益更大。进一步分析显示,循环中断器可将易陷循环模型的无效步骤减少近半。
English
Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.