ChatPaper.aiChatPaper

UI-KOBE:面向知识的行为探索——用于轻量级图引导的GUI代理

UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents

May 28, 2026
作者: Yuxiang Chai, Han Xiao, Xinyu Fu, Jinpeng Chen, Rui Liu, Hongsheng Li
cs.AI

摘要

近年来,移动端GUI代理在自动化移动任务方面展现出巨大潜力,但大多数高效系统仍依赖大型视觉语言模型进行截图理解和长期规划。能够直接部署在移动设备上的小型GUI代理更具实际应用价值,具有更低的推理成本和更好的设备敏感信息保护能力。然而,由于模型容量有限,这类轻量级代理在仅凭截图端到端规划并执行GUI任务时仍不可靠。我们提出面向知识的行为探索框架(UI-KOBE),该框架通过可复用的应用特定图知识来增强轻量级移动GUI代理。UI-KOBE首先自主探索移动应用并构建应用知识图谱,其中节点代表不同UI状态,边代表可执行的转换。在运行时,轻量级GUI代理将该图谱作为外部指导:给定用户任务和当前截图后,它识别当前图节点,并从与该节点关联的自环动作、相邻转换、任务完成或回退自由动作中进行选择。通过支持基于应用特定图谱指导的运行时决策,UI-KOBE减轻了端到端GUI规划的负担,帮助轻量级模型更有效地执行移动GUI任务,为构建高效、可解释且注重隐私的设备端GUI代理迈出实用的一步。
English
Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and long-horizon planning. Small GUI agents that can be deployed directly on mobile devices are more attractive for practical use, offering lower inference cost and better protection of sensitive on-device information. However, due to limited model capacity, such lightweight agents remain unreliable when planning and executing GUI tasks end-to-end from screenshots alone. We propose Knowledge-Oriented Behavior Exploration (UI-KOBE), a framework that improves lightweight mobile GUI agents with reusable app-specific graph knowledge. UI-KOBE first autonomously explores a mobile application and constructs an app knowledge graph, where nodes represent distinct UI states and edges represent executable transitions. At runtime, a lightweight GUI agent uses the graph as external guidance: given a user task and the current screenshot, it identifies the current graph node and selects among self-loop actions, neighboring transitions, task completion, or fallback free actions associated with that node. By supporting runtime decisions with app-specific graph guidance, UI-KOBE reduces the burden of end-to-end GUI planning and helps lightweight models perform mobile GUI tasks more effectively, offering a practical step toward efficient, interpretable, and privacy-conscious on-device GUI agents.