ChatPaper.aiChatPaper

UI-KOBE:面向知识的轻量级图引导GUI代理行为探索

UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents

May 28, 2026
作者: Yuxiang Chai, Han Xiao, Xinyu Fu, Jinpeng Chen, Rui Liu, Hongsheng Li
cs.AI

摘要

近期,移动GUI代理的進展展現出自動化執行行動裝置任務的巨大潛力,但大多數高效系統仍依賴大型視覺語言模型來理解螢幕畫面與進行長期規劃。能夠直接部署於行動裝置上的小型GUI代理,因具備較低的推論成本與更完善的敏感資料在地保護能力,在實際應用中更具吸引力。然而,受限於模型容量,這類輕量級代理在僅憑螢幕畫面從頭到尾規劃與執行GUI任務時仍不可靠。我們提出「知識導向行為探索框架(UI-KOBE)」,這是一套透過可重複使用的應用特定圖形知識來強化輕量級行動GUI代理的架構。UI-KOBE首先自主探索行動應用程式並建構應用知識圖譜,其中節點代表不同的UI狀態,邊代表可執行的狀態轉換。在執行階段,輕量級GUI代理將此圖譜作為外部指引:根據使用者任務與當前螢幕畫面,它會識別當前的圖節點,並從與該節點相關的自我迴圈動作、鄰近轉換、任務完成或備用自由動作中做出選擇。透過以應用特定圖形指引支援執行階段的決策,UI-KOBE減輕了端到端GUI規劃的負擔,幫助輕量級模型更有效地執行行動GUI任務,為朝向高效、可解釋且注重隱私的裝置端GUI代理邁出了務實的一步。
English
Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and long-horizon planning. Small GUI agents that can be deployed directly on mobile devices are more attractive for practical use, offering lower inference cost and better protection of sensitive on-device information. However, due to limited model capacity, such lightweight agents remain unreliable when planning and executing GUI tasks end-to-end from screenshots alone. We propose Knowledge-Oriented Behavior Exploration (UI-KOBE), a framework that improves lightweight mobile GUI agents with reusable app-specific graph knowledge. UI-KOBE first autonomously explores a mobile application and constructs an app knowledge graph, where nodes represent distinct UI states and edges represent executable transitions. At runtime, a lightweight GUI agent uses the graph as external guidance: given a user task and the current screenshot, it identifies the current graph node and selects among self-loop actions, neighboring transitions, task completion, or fallback free actions associated with that node. By supporting runtime decisions with app-specific graph guidance, UI-KOBE reduces the burden of end-to-end GUI planning and helps lightweight models perform mobile GUI tasks more effectively, offering a practical step toward efficient, interpretable, and privacy-conscious on-device GUI agents.