VLAA-GUI：掌握停止、恢复与搜索时机的模块化图形用户界面自动化框架

摘要

自主GUI智能体面临两大核心挑战：过早终止（智能体在缺乏可验证证据时提前宣告成功）与循环重复（智能体在相同失败操作中陷入死循环而无法恢复）。我们提出VLAA-GUI——一个围绕三大集成组件构建的模块化GUI智能体框架，通过"停止、恢复、搜索"三重机制引导系统决策。首先，强制性完成度验证器在每步结束时执行基于UI界面的成功标准核验：其智能体级验证器通过决策规则交叉审阅完成声明，拒绝缺乏直接视觉证据的结论。其次，强制性循环中断器提供多级过滤机制：在重复失败后切换交互模式，在屏幕状态持续复现时强制改变策略，并将反思信号与策略调整绑定。第三，按需启用的搜索代理可通过直接向具备搜索能力的大语言模型查询，为陌生工作流程进行在线搜索并以纯文本返回结果。我们还集成了按需调用的编码代理（处理代码密集型操作）与 grounding代理（实现精准操作定位）。在包含Linux和Windows任务的两种基准测试中，VLAA-GUI在Opus 4.5、4.6及Gemini 3.1 Pro等五大顶级骨干模型上均取得最优性能（OSWorld达77.5%，WindowsAgentArena达61.0%）。值得注意的是，五个骨干模型中有三个在OSWorld上单次通过即超越人类表现（72.4%）。消融实验表明，所有三个组件均能持续增强强骨干模型性能，而弱骨干模型在步数预算充足时从这些工具中获益更大。进一步分析显示，循环中断器可将易陷循环模型的无效步骤减少近半。

English

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.