ShowUI-Aloha:人类指导的图形界面交互智能体
ShowUI-Aloha: Human-Taught GUI Agent
January 12, 2026
作者: Yichun Zhang, Xiangwu Guo, Yauhong Goh, Jessica Hu, Zhiheng Chen, Xin Wang, Difei Gao, Mike Zheng Shou
cs.AI
摘要
图形用户界面(GUI)是人机交互的核心,但自动化复杂GUI任务仍是自主智能体面临的主要挑战,这很大程度上源于缺乏可扩展的高质量训练数据。虽然人类操作记录提供了丰富的数据源,但这些记录通常冗长、非结构化且缺乏标注,使得智能体难以从中学习。为此,我们推出ShowUI-Aloha——一个将桌面环境中非结构化的真实人类屏幕录像转化为可执行结构化任务的完整流程。该框架包含四个核心组件:记录器精准捕捉屏幕视频及用户交互行为(如鼠标点击、键盘输入和滚动操作);学习器通过语义解析原始交互行为及视觉上下文,将其转化为描述性自然语言标注;规划器读取已解析的演示记录,维护任务状态,并基于情境推理动态制定高层级行动方案;执行器在操作系统层面忠实执行行动方案,通过安全校验和实时反馈实现精准点击、拖拽、文本输入及窗口操作。这些组件共同构成了收集与解析真实人类操作数据的可扩展解决方案,为构建能够通过观察人类行为有效学习的通用GUI智能体开辟了可行路径。
English
Graphical User Interfaces (GUIs) are central to human-computer interaction, yet automating complex GUI tasks remains a major challenge for autonomous agents, largely due to a lack of scalable, high-quality training data. While recordings of human demonstrations offer a rich data source, they are typically long, unstructured, and lack annotations, making them difficult for agents to learn from.To address this, we introduce ShowUI-Aloha, a comprehensive pipeline that transforms unstructured, in-the-wild human screen recordings from desktop environments into structured, actionable tasks. Our framework includes four key components: A recorder that captures screen video along with precise user interactions like mouse clicks, keystrokes, and scrolls. A learner that semantically interprets these raw interactions and the surrounding visual context, translating them into descriptive natural language captions. A planner that reads the parsed demonstrations, maintains task states, and dynamically formulates the next high-level action plan based on contextual reasoning. An executor that faithfully carries out these action plans at the OS level, performing precise clicks, drags, text inputs, and window operations with safety checks and real-time feedback. Together, these components provide a scalable solution for collecting and parsing real-world human data, demonstrating a viable path toward building general-purpose GUI agents that can learn effectively from simply observing humans.