ChatPaper.aiChatPaper

Mirage-1:通过分层多模态技能增强与更新GUI智能体

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

June 12, 2025
作者: Yuquan Xie, Zaijing Li, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Dongmei Jiang, Liqiang Nie
cs.AI

摘要

近期利用多模态大语言模型(MLLM)作为图形用户界面(GUI)代理的研究取得了显著成果。然而,这些代理在处理在线环境中的长期任务时仍面临挑战,主要源于知识储备不足以及离线与在线领域之间的固有差异。本文受人类在开放环境中知识泛化方式的启发,提出了一个层次化多模态技能(HMS)模块,以应对知识不足的问题。该模块逐步将执行轨迹抽象为执行技能、核心技能,最终形成元技能,为长期任务规划提供了层次化的知识结构。为弥合领域差距,我们提出了技能增强蒙特卡洛树搜索(SA-MCTS)算法,该算法高效利用离线环境中习得的技能,在在线树搜索过程中缩减动作搜索空间。基于HMS,我们开发了Mirage-1,一个多模态、跨平台、即插即用的GUI代理。为验证Mirage-1在现实世界长期任务中的表现,我们构建了新的基准测试集AndroidLH。实验结果显示,Mirage-1在AndroidWorld、MobileMiniWob++、Mind2Web-Live和AndroidLH上的表现分别比以往代理提升了32%、19%、15%和79%。项目页面:https://cybertronagent.github.io/Mirage-1.github.io/
English
Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI agents have yielded promising outcomes. However, these agents still struggle with long-horizon tasks in online environments, primarily due to insufficient knowledge and the inherent gap between offline and online domains. In this paper, inspired by how humans generalize knowledge in open-ended environments, we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge. It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning. To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm, which efficiently leverages skills acquired in offline environments to reduce the action search space during online tree exploration. Building on HMS, we propose Mirage-1, a multimodal, cross-platform, plug-and-play GUI agent. To validate the performance of Mirage-1 in real-world long-horizon scenarios, we constructed a new benchmark, AndroidLH. Experimental results show that Mirage-1 outperforms previous agents by 32\%, 19\%, 15\%, and 79\% on AndroidWorld, MobileMiniWob++, Mind2Web-Live, and AndroidLH, respectively. Project page: https://cybertronagent.github.io/Mirage-1.github.io/
PDF42June 16, 2025