VLAA-GUI:掌握何時停止、恢復與搜尋的時機——一套模組化圖形使用者介面自動化框架
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
April 23, 2026
作者: Qijun Han, Haoqin Tu, Zijun Wang, Haoyue Dai, Yiyang Zhou, Nancy Lau, Alvaro A. Cardenas, Yuhui Xu, Ran Xu, Caiming Xiong, Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, Cihang Xie
cs.AI
摘要
自主GUI智能體面臨兩項根本性挑戰:過早終止(智能體在缺乏可驗證證據時便提前宣告成功)與循環重複(智能體在相同失敗操作間陷入無恢復的循環)。我們提出VLAA-GUI——一個模組化GUI智能體框架,其核心由三個協同組件構成,分別指導系統何時停止、恢復與搜尋。首先,強制性完成度驗證器在每個結束步驟執行介面可觀測的成功標準驗證,透過具備決策規則的智能體級驗證器交叉檢視完成聲明,拒絕缺乏直接視覺證據的主張。其次,強制性循環中斷器提供多層過濾機制:在重複失敗後切換互動模式、於持續出現相同螢幕狀態時強制改變策略,並將反思信號與策略轉換綁定。第三,按需啟動的搜尋代理能為陌生工作流程直接向具備搜尋能力的LLM發起查詢,並以純文字形式返回結果。我們還整合了用於程式碼密集型操作的編碼代理,以及實現精準操作定位的錨定代理,二者皆在需要時按需呼叫。我們在包含Linux與Windows任務的兩大基準測試中,對VLAA-GUI進行五款頂級基礎模型(含Opus 4.5、4.6及Gemini 3.1 Pro)的評估,其在OSWorld達到77.5%、WindowsAgentArena達到61.0%的頂尖表現。值得注意的是,五款模型中有三款在OSWorld的單次測試中超越人類表現(72.4%)。消融實驗顯示,所有三個組件均能持續增強強基礎模型,而當步數預算充足時,弱基礎模型更能從這些工具中獲益。進一步分析表明,循環中斷器可為易陷循環的模型減少近半的無效操作步數。
English
Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.