ClawGUI: Un Framework Unificato per l'Addestramento, la Valutazione e il Deployment di Agenti GUI

Abstract

Gli agenti GUI guidano le applicazioni attraverso le loro interfacce visive anziché tramite API programmatiche, interagendo con software arbitrari mediante tocchi, scorrimenti e pressioni di tasti, raggiungendo una lunga coda di applicazioni inaccessibili agli agenti basati su CLI. Tuttavia, i progressi in questo ambito sono limitati meno dalla capacità modellistica che dall'assenza di un'infrastruttura coerente full-stack: l'addestramento RL online soffre di instabilità ambientale e pipeline chiuse, i protocolli di valutazione divergono silenziosamente tra i vari lavori, e gli agenti addestrati raramente raggiungono utenti reali su dispositivi reali. Presentiamo ClawGUI, un framework open-source che affronta queste tre lacune all'interno di un'unica piattaforma. ClawGUI-RL fornisce la prima infrastruttura RL per agenti GUI open-source con supporto validato sia per ambienti virtuali paralleli che per dispositivi fisici reali, integrando GiGPO con un Process Reward Model per una supervisione densa a livello di step. ClawGUI-Eval applica una pipeline di valutazione completamente standardizzata su 6 benchmark e oltre 11 modelli, raggiungendo una riproducibilità del 95,8% rispetto ai baseline ufficiali. ClawGUI-Agent porta gli agenti addestrati su Android, HarmonyOS e iOS attraverso oltre 12 piattaforme di chat con controllo ibrido CLI-GUI e memoria persistente personalizzata. Addestrato end-to-end all'interno di questa pipeline, ClawGUI-2B raggiunge un Success Rate del 17,1% su MobileWorld GUI-Only, superando di 6,0% il baseline MAI-UI-2B alla stessa scala.

English

GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present ClawGUI, an open-source framework addressing these three gaps within a single harness. ClawGUI-RL provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. ClawGUI-Eval enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8\% reproduction against official baselines. ClawGUI-Agent brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, ClawGUI-2B achieves 17.1\% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0\%.

ClawGUI: Un Framework Unificato per l'Addestramento, la Valutazione e il Deployment di Agenti GUI

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

Abstract

Support