ClawGUI: トレーニング、評価、デプロイを統合するGUIエージェントフレームワーク

要旨

GUIエージェントは、プログラム的なAPIではなく視覚的インターフェースを通じてアプリケーションを駆動し、タップ、スワイプ、キーストロークによって任意のソフトウェアと対話します。これにより、CLIベースのエージェントが到達できないロングテールのアプリケーションに対応可能です。しかし、この分野の進展は、モデルの能力よりも、一貫したフルスタックインフラの欠如によって妨げられています。オンライン強化学習では環境の不安定性や閉鎖的なパイプラインが課題となり、評価プロトコルは研究間で静かに乖離し、学習済みエージェントが実デバイス上の実ユーザーに届くことは稀です。本論文では、これら3つのギャップを単一のハーネス内で解決するオープンソースフレームワーク「ClawGUI」を提案します。ClawGUI-RLは、並列仮想環境と実物理デバイスの両方をサポートする初のオープンソースGUIエージェント強化学習インフラを提供し、GiGPOをProcess Reward Modelと統合して密なステップ単位の監督を実現します。ClawGUI-Evalは、6つのベンチマークと11以上のモデルにわたる完全に標準化された評価パイプラインを強制し、公式ベースラインに対して95.8%の再現性を達成します。ClawGUI-Agentは、学習済みエージェントをAndroid、HarmonyOS、iOSに展開し、12以上のチャットプラットフォームでハイブリッドCLI-GUI制御と永続的な個人化メモリを実現します。このパイプライン内でエンドツーエンドに学習されたClawGUI-2Bは、MobileWorld GUI-Onlyにおいて17.1%の成功率を達成し、同規模のMAI-UI-2Bベースラインを6.0%上回りました。

English

GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present ClawGUI, an open-source framework addressing these three gaps within a single harness. ClawGUI-RL provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. ClawGUI-Eval enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8\% reproduction against official baselines. ClawGUI-Agent brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, ClawGUI-2B achieves 17.1\% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0\%.

ClawGUI: トレーニング、評価、デプロイを統合するGUIエージェントフレームワーク

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

要旨

Support