生成式用户界面的计算机代理评判机制

摘要

计算机使用代理（CUA）通过图形用户界面（GUI）自主操作数字环境的能力日益增强。然而，大多数GUI仍主要面向人类设计——优先考虑美观性和可用性——迫使代理采用对人类必要但不利于高效任务执行的操作方式。与此同时，面向编程的语言模型（Coder）的快速发展正在改变自动GUI设计范式。这引出一个核心问题：能否以CUA作为评判者来辅助Coder进行自动GUI设计？为探索该问题，我们推出AUI-Gym基准测试集，涵盖52个跨领域应用的自动GUI开发任务。基于语言模型，我们合成了模拟真实场景的1560项任务。为确保任务可靠性，我们进一步开发了可通过编程验证各任务在对应环境中可执行性的检查器。在此基础上，我们提出“编码者-代理协同”框架：Coder担任设计者角色，生成并迭代网站方案；CUA作为评判者，评估功能实现并优化设计。成功标准并非视觉呈现，而是以任务可解性与CUA导航成功率为衡量依据。为将CUA反馈转化为可操作指导，我们设计了CUA仪表盘，将多步导航历史压缩为简洁的可视化摘要，为迭代重设计提供可解释的指引。通过让代理同时担任设计者与评判者，我们的框架将界面设计推向面向代理原生效率与可靠性的新范式。本研究推动代理从被动使用转向数字环境的主动参与。代码与数据集详见https://github.com/showlab/AUI。

English

Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.

生成式用户界面的计算机代理评判机制

Computer-Use Agents as Judges for Generative User Interface

摘要

Support