生成式使用者介面中電腦使用代理作為評判者

摘要

電腦使用代理（CUA）正日益具備透過圖形用戶界面（GUI）自主操作數位環境的能力。然而，大多數GUI仍主要為人類設計——優先考慮美觀性和可用性——這迫使代理採取人類導向的行為模式，但這些行為對高效執行任務並非必要。與此同時，面向編碼的語言模型（Coder）快速發展，正在變革自動化GUI設計。這引發了一個根本性問題：能否以CUA作為評判者來輔助Coder進行自動化GUI設計？為探究此問題，我們推出AUI-Gym——一個涵蓋52個跨領域應用的自動化GUI開發基準測試平台。利用語言模型，我們合成了模擬真實場景的1560項任務。為確保任務可靠性，我們進一步開發了可程式化驗證器，用於檢查每項任務是否能在對應環境中執行。基於此，我們提出「編碼者-代理協作」框架：Coder擔任設計師角色，生成並修改網站；CUA則作為評判者，評估功能性並優化設計。成功標準並非視覺效果，而是任務可解決性與CUA導航成功率。為將CUA反饋轉化為可行指導，我們設計了CUA儀表板，將多步驟導航歷程壓縮為簡明視覺摘要，為迭代重設計提供可解釋的指引。通過讓代理同時擔任設計者與評判者，我們的框架將介面設計推向以代理為本源的效率與可靠性。此研究推動代理從被動使用轉向主動參與數位環境。我們的程式碼與數據集已公開於：https://github.com/showlab/AUI。

English

Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.

生成式使用者介面中電腦使用代理作為評判者

Computer-Use Agents as Judges for Generative User Interface

摘要

Support