UI-KOBE: 軽量グラフ誘導型GUIエージェントのための知識指向型行動探索

要旨

近年、モバイルGUIエージェントの進展により、モバイルタスクの自動化に大きな可能性が示されているが、現在のほとんどの効果的なシステムは、スクリーンショットの理解や長期的な計画のために大規模視覚言語モデルに依存している。モバイル端末に直接デプロイ可能な小型GUIエージェントは、推論コストの低減やデバイス上の機密情報の保護の面で実用的に魅力的である。しかしながら、モデル容量の制約により、こうした軽量エージェントはスクリーンショットのみからのエンドツーエンドのGUIタスク計画・実行において依然として信頼性が低い。本稿では、再利用可能なアプリ固有のグラフ知識を用いて軽量モバイルGUIエージェントを改善するフレームワーク「Knowledge-Oriented Behavior Exploration (UI-KOBE)」を提案する。UI-KOBEはまず、モバイルアプリケーションを自律的に探索し、ノードが異なるUI状態を表し、エッジが実行可能な遷移を表すアプリ知識グラフを構築する。実行時には、軽量GUIエージェントがこのグラフを外部ガイダンスとして利用し、ユーザのタスクと現在のスクリーンショットから現在のグラフノードを特定し、そのノードに関連付けられた自己ループアクション、隣接遷移、タスク完了、またはフォールバック自由行動の中から選択する。アプリ固有のグラフガイダンスで実行時判断を支援することで、UI-KOBEはエンドツーエンドのGUI計画の負担を軽減し、軽量モデルがモバイルGUIタスクをより効果的に実行できるようにする。これにより、効率的で解釈可能、かつプライバシーに配慮したオンデバイスGUIエージェントへの実用的な一歩を提供する。

English

Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and long-horizon planning. Small GUI agents that can be deployed directly on mobile devices are more attractive for practical use, offering lower inference cost and better protection of sensitive on-device information. However, due to limited model capacity, such lightweight agents remain unreliable when planning and executing GUI tasks end-to-end from screenshots alone. We propose Knowledge-Oriented Behavior Exploration (UI-KOBE), a framework that improves lightweight mobile GUI agents with reusable app-specific graph knowledge. UI-KOBE first autonomously explores a mobile application and constructs an app knowledge graph, where nodes represent distinct UI states and edges represent executable transitions. At runtime, a lightweight GUI agent uses the graph as external guidance: given a user task and the current screenshot, it identifies the current graph node and selects among self-loop actions, neighboring transitions, task completion, or fallback free actions associated with that node. By supporting runtime decisions with app-specific graph guidance, UI-KOBE reduces the burden of end-to-end GUI planning and helps lightweight models perform mobile GUI tasks more effectively, offering a practical step toward efficient, interpretable, and privacy-conscious on-device GUI agents.