ShowUI-Aloha: Agente GUI Addestrato da Umani

Abstract

Le interfacce grafiche utente (GUI) sono fondamentali per l'interazione uomo-computer, ma l'automazione di compiti complessi su GUI rimane una sfida importante per gli agenti autonomi, principalmente a causa della mancanza di dati di addestramento scalabili e di alta qualità. Sebbene le registrazioni di dimostrazioni umane costituiscano una ricca fonte di dati, queste sono tipicamente lunghe, non strutturate e prive di annotazioni, rendendole difficili da apprendere per gli agenti. Per affrontare questo problema, introduciamo ShowUI-Aloha, una pipeline completa che trasforma registrazioni non strutturate dello schermo umano, provenienti da ambienti desktop in contesti reali, in compiti strutturati e azionabili. Il nostro framework include quattro componenti chiave: Un registratore che cattura il video dello schermo insieme alle interazioni utente precise come clic del mouse, pressioni di tasti e scorrimenti. Un modulo di apprendimento che interpreta semanticamente queste interazioni grezze e il contesto visivo circostante, traducendoli in descrizioni in linguaggio naturale. Un pianificatore che legge le dimostrazioni analizzate, mantiene gli stati del compito e formula dinamicamente il piano d'azione di alto livello successivo basandosi sul ragionamento contestuale. Un esecutore che mette fedelmente in pratica questi piani d'azione a livello di sistema operativo, eseguendo clic precisi, trascinamenti, inserimenti di testo e operazioni sulle finestre con controlli di sicurezza e feedback in tempo reale. Insieme, questi componenti forniscono una soluzione scalabile per raccogliere e analizzare dati umani del mondo reale, dimostrando un percorso percorribile verso la creazione di agenti GUI generici in grado di apprendere efficacemente semplicemente osservando gli esseri umani.

English

Graphical User Interfaces (GUIs) are central to human-computer interaction, yet automating complex GUI tasks remains a major challenge for autonomous agents, largely due to a lack of scalable, high-quality training data. While recordings of human demonstrations offer a rich data source, they are typically long, unstructured, and lack annotations, making them difficult for agents to learn from.To address this, we introduce ShowUI-Aloha, a comprehensive pipeline that transforms unstructured, in-the-wild human screen recordings from desktop environments into structured, actionable tasks. Our framework includes four key components: A recorder that captures screen video along with precise user interactions like mouse clicks, keystrokes, and scrolls. A learner that semantically interprets these raw interactions and the surrounding visual context, translating them into descriptive natural language captions. A planner that reads the parsed demonstrations, maintains task states, and dynamically formulates the next high-level action plan based on contextual reasoning. An executor that faithfully carries out these action plans at the OS level, performing precise clicks, drags, text inputs, and window operations with safety checks and real-time feedback. Together, these components provide a scalable solution for collecting and parsing real-world human data, demonstrating a viable path toward building general-purpose GUI agents that can learn effectively from simply observing humans.

ShowUI-Aloha: Agente GUI Addestrato da Umani

ShowUI-Aloha: Human-Taught GUI Agent

Abstract

Support