WorldGUI：針對全面桌面 GUI 自動化的動態測試

摘要

目前的GUI代理在GUI元素 grounding 方面取得了出色的表現。然而，規劃仍然極具挑戰性，特別是由於對環境初始狀態的敏感性。具體來說，初始狀態中的輕微差異，例如目標軟件未打開或界面不處於默認狀態，通常會導致規劃錯誤。這個問題在真實用戶場景中非常普遍，但現有的基準測試未能評估它。在本文中，我們提出了WorldGUI，一個新穎的GUI基準測試，設計了具有各種初始狀態的GUI任務，以模擬真實的電腦用戶交互。該基準測試涵蓋了跨越10個熱門軟件應用程序的各種任務，包括PowerPoint、VSCode和Adobe Acrobat。此外，為應對動態GUI自動化任務的挑戰，我們提出了GUI-Thinker，一個全面的框架，利用評論機制，有效管理GUI交互的不可預測性和複雜性。實驗結果表明，GUI-Thinker在WorldGUI任務的成功率上比Claude-3.5（電腦使用）提高了14.9%。這一改進突顯了我們基於批判性思維的框架在增強GUI自動化方面的有效性。

English

Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state-often lead to planning errors. This issue is widespread in real user scenarios, but existing benchmarks fail to evaluate it. In this paper, we present WorldGUI, a novel GUI benchmark that designs GUI tasks with various initial states to simulate real computer-user interactions. The benchmark spans a wide range of tasks across 10 popular software applications, including PowerPoint, VSCode, and Adobe Acrobat. In addition, to address the challenges of dynamic GUI automation tasks, we propose GUI-Thinker, a holistic framework, leveraging a critique mechanism, that effectively manages the unpredictability and complexity of GUI interactions. Experimental results demonstrate that GUI-Thinker significantly outperforms Claude-3.5 (Computer Use) by 14.9% in success rate on WorldGUI tasks. This improvement underscores the effectiveness of our critical-thinking-based framework in enhancing GUI automation.

WorldGUI：針對全面桌面 GUI 自動化的動態測試

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

摘要

Support