WorldGUI:針對全面桌面 GUI 自動化的動態測試
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation
February 12, 2025
作者: Henry Hengyuan Zhao, Difei Gao, Mike Zheng Shou
cs.AI
摘要
目前的GUI代理在GUI元素 grounding 方面取得了出色的表現。然而,規劃仍然極具挑戰性,特別是由於對環境初始狀態的敏感性。具體來說,初始狀態中的輕微差異,例如目標軟件未打開或界面不處於默認狀態,通常會導致規劃錯誤。這個問題在真實用戶場景中非常普遍,但現有的基準測試未能評估它。在本文中,我們提出了WorldGUI,一個新穎的GUI基準測試,設計了具有各種初始狀態的GUI任務,以模擬真實的電腦用戶交互。該基準測試涵蓋了跨越10個熱門軟件應用程序的各種任務,包括PowerPoint、VSCode和Adobe Acrobat。此外,為應對動態GUI自動化任務的挑戰,我們提出了GUI-Thinker,一個全面的框架,利用評論機制,有效管理GUI交互的不可預測性和複雜性。實驗結果表明,GUI-Thinker在WorldGUI任務的成功率上比Claude-3.5(電腦使用)提高了14.9%。這一改進突顯了我們基於批判性思維的框架在增強GUI自動化方面的有效性。
English
Current GUI agents have achieved outstanding performance in GUI element
grounding. However, planning remains highly challenging, especially due to
sensitivity to the initial state of the environment. Specifically, slight
differences in the initial state-such as the target software not being open or
the interface not being in its default state-often lead to planning errors.
This issue is widespread in real user scenarios, but existing benchmarks fail
to evaluate it. In this paper, we present WorldGUI, a novel GUI benchmark that
designs GUI tasks with various initial states to simulate real computer-user
interactions. The benchmark spans a wide range of tasks across 10 popular
software applications, including PowerPoint, VSCode, and Adobe Acrobat. In
addition, to address the challenges of dynamic GUI automation tasks, we propose
GUI-Thinker, a holistic framework, leveraging a critique mechanism, that
effectively manages the unpredictability and complexity of GUI interactions.
Experimental results demonstrate that GUI-Thinker significantly outperforms
Claude-3.5 (Computer Use) by 14.9% in success rate on WorldGUI tasks. This
improvement underscores the effectiveness of our critical-thinking-based
framework in enhancing GUI automation.Summary
AI-Generated Summary