UI-KOBE: Wissensorientierte Verhaltenserkundung für leichte graphgeführte GUI-Agenten

Zusammenfassung

Jüngste Fortschritte bei mobilen GUI-Agenten haben ein großes Potenzial für die Automatisierung mobiler Aufgaben gezeigt, jedoch sind die meisten effektiven Systeme nach wie vor auf große Vision-Language-Modelle für das Bildschirmverständnis und langfristige Planung angewiesen. Kleine GUI-Agenten, die direkt auf mobilen Geräten eingesetzt werden können, sind für die praktische Nutzung attraktiver, da sie geringere Inferenzkosten und einen besseren Schutz sensibler geräteinterner Informationen bieten. Aufgrund der begrenzten Modellkapazität bleiben solche leichtgewichtigen Agenten jedoch unzuverlässig, wenn sie GUI-Aufgaben allein auf Basis von Bildschirmaufnahmen vollständig planen und ausführen sollen. Wir stellen UI-KOBE (Knowledge-Oriented Behavior Exploration) vor, ein Framework, das leichtgewichtige mobile GUI-Agenten durch wiederverwendbares, app-spezifisches Graphwissen verbessert. UI-KOBE erkundet zunächst autonom eine mobile Anwendung und erstellt einen App-Wissensgraphen, in dem Knoten verschiedene UI-Zustände und Kanten ausführbare Übergänge darstellen. Zur Laufzeit nutzt ein leichtgewichtiger GUI-Agent den Graphen als externe Orientierung: Anhand einer Benutzeraufgabe und des aktuellen Bildschirmfotos identifiziert er den aktuellen Graphenknoten und wählt zwischen Selbstschleifen-Aktionen, benachbarten Übergängen, Aufgabenabschluss oder Fallback-Freihandlungen, die mit diesem Knoten verbunden sind. Indem UI-KOBE Laufzeitentscheidungen durch app-spezifische Graphenführung unterstützt, verringert es die Belastung durch eine vollständige GUI-Planung und hilft leichten Modellen, mobile GUI-Aufgaben effektiver auszuführen. Dies stellt einen praktischen Schritt hin zu effizienten, interpretierbaren und datenschutzbewussten, geräteinternen GUI-Agenten dar.

English

Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and long-horizon planning. Small GUI agents that can be deployed directly on mobile devices are more attractive for practical use, offering lower inference cost and better protection of sensitive on-device information. However, due to limited model capacity, such lightweight agents remain unreliable when planning and executing GUI tasks end-to-end from screenshots alone. We propose Knowledge-Oriented Behavior Exploration (UI-KOBE), a framework that improves lightweight mobile GUI agents with reusable app-specific graph knowledge. UI-KOBE first autonomously explores a mobile application and constructs an app knowledge graph, where nodes represent distinct UI states and edges represent executable transitions. At runtime, a lightweight GUI agent uses the graph as external guidance: given a user task and the current screenshot, it identifies the current graph node and selects among self-loop actions, neighboring transitions, task completion, or fallback free actions associated with that node. By supporting runtime decisions with app-specific graph guidance, UI-KOBE reduces the burden of end-to-end GUI planning and helps lightweight models perform mobile GUI tasks more effectively, offering a practical step toward efficient, interpretable, and privacy-conscious on-device GUI agents.