ShowUI-Aloha：人类指导的图形界面智能体

摘要

图形用户界面(GUI)是人机交互的核心，但自动化复杂GUI任务仍是自主智能体面临的主要挑战，这主要源于缺乏可扩展的高质量训练数据。虽然人类操作记录提供了丰富的数据源，但这些数据通常冗长、非结构化且缺乏标注，导致智能体难以有效学习。为此，我们推出ShowUI-Aloha系统——一个将桌面环境中非结构化的野生人类屏幕录像转化为结构化可执行任务的完整流程。该框架包含四个核心组件：记录器负责采集屏幕视频及精确的用户交互（如鼠标点击、键盘输入和滚动操作）；学习器通过语义解析原始交互行为与视觉上下文，将其转化为描述性自然语言标注；规划器读取解析后的演示数据，维护任务状态，并基于情境推理动态制定高层动作计划；执行器在操作系统层面忠实执行动作计划，通过安全检查与实时反馈实现精准点击、拖拽、文本输入及窗口操作。这些组件共同构成了采集和解析真实人类数据的可扩展解决方案，为构建能够通过观察人类操作即可高效学习的通用GUI智能体开辟了可行路径。

English

Graphical User Interfaces (GUIs) are central to human-computer interaction, yet automating complex GUI tasks remains a major challenge for autonomous agents, largely due to a lack of scalable, high-quality training data. While recordings of human demonstrations offer a rich data source, they are typically long, unstructured, and lack annotations, making them difficult for agents to learn from.To address this, we introduce ShowUI-Aloha, a comprehensive pipeline that transforms unstructured, in-the-wild human screen recordings from desktop environments into structured, actionable tasks. Our framework includes four key components: A recorder that captures screen video along with precise user interactions like mouse clicks, keystrokes, and scrolls. A learner that semantically interprets these raw interactions and the surrounding visual context, translating them into descriptive natural language captions. A planner that reads the parsed demonstrations, maintains task states, and dynamically formulates the next high-level action plan based on contextual reasoning. An executor that faithfully carries out these action plans at the OS level, performing precise clicks, drags, text inputs, and window operations with safety checks and real-time feedback. Together, these components provide a scalable solution for collecting and parsing real-world human data, demonstrating a viable path toward building general-purpose GUI agents that can learn effectively from simply observing humans.

ShowUI-Aloha：人类指导的图形界面智能体

ShowUI-Aloha: Human-Taught GUI Agent

摘要

Support