Mirage-1:通过分层多模态技能增强与更新GUI智能体
Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills
June 12, 2025
作者: Yuquan Xie, Zaijing Li, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Dongmei Jiang, Liqiang Nie
cs.AI
摘要
近期利用多模态大语言模型(MLLM)作为图形用户界面(GUI)代理的研究取得了显著成果。然而,这些代理在处理在线环境中的长期任务时仍面临挑战,主要源于知识储备不足以及离线与在线领域之间的固有差异。本文受人类在开放环境中知识泛化方式的启发,提出了一个层次化多模态技能(HMS)模块,以应对知识不足的问题。该模块逐步将执行轨迹抽象为执行技能、核心技能,最终形成元技能,为长期任务规划提供了层次化的知识结构。为弥合领域差距,我们提出了技能增强蒙特卡洛树搜索(SA-MCTS)算法,该算法高效利用离线环境中习得的技能,在在线树搜索过程中缩减动作搜索空间。基于HMS,我们开发了Mirage-1,一个多模态、跨平台、即插即用的GUI代理。为验证Mirage-1在现实世界长期任务中的表现,我们构建了新的基准测试集AndroidLH。实验结果显示,Mirage-1在AndroidWorld、MobileMiniWob++、Mind2Web-Live和AndroidLH上的表现分别比以往代理提升了32%、19%、15%和79%。项目页面:https://cybertronagent.github.io/Mirage-1.github.io/
English
Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI
agents have yielded promising outcomes. However, these agents still struggle
with long-horizon tasks in online environments, primarily due to insufficient
knowledge and the inherent gap between offline and online domains. In this
paper, inspired by how humans generalize knowledge in open-ended environments,
we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of
insufficient knowledge. It progressively abstracts trajectories into execution
skills, core skills, and ultimately meta-skills, providing a hierarchical
knowledge structure for long-horizon task planning. To bridge the domain gap,
we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm,
which efficiently leverages skills acquired in offline environments to reduce
the action search space during online tree exploration. Building on HMS, we
propose Mirage-1, a multimodal, cross-platform, plug-and-play GUI agent. To
validate the performance of Mirage-1 in real-world long-horizon scenarios, we
constructed a new benchmark, AndroidLH. Experimental results show that Mirage-1
outperforms previous agents by 32\%, 19\%, 15\%, and 79\% on AndroidWorld,
MobileMiniWob++, Mind2Web-Live, and AndroidLH, respectively. Project page:
https://cybertronagent.github.io/Mirage-1.github.io/