Mirage-1：通过分层多模态技能增强与更新GUI智能体

摘要

近期利用多模态大语言模型（MLLM）作为图形用户界面（GUI）代理的研究取得了显著成果。然而，这些代理在处理在线环境中的长期任务时仍面临挑战，主要源于知识储备不足以及离线与在线领域之间的固有差异。本文受人类在开放环境中知识泛化方式的启发，提出了一个层次化多模态技能（HMS）模块，以应对知识不足的问题。该模块逐步将执行轨迹抽象为执行技能、核心技能，最终形成元技能，为长期任务规划提供了层次化的知识结构。为弥合领域差距，我们提出了技能增强蒙特卡洛树搜索（SA-MCTS）算法，该算法高效利用离线环境中习得的技能，在在线树搜索过程中缩减动作搜索空间。基于HMS，我们开发了Mirage-1，一个多模态、跨平台、即插即用的GUI代理。为验证Mirage-1在现实世界长期任务中的表现，我们构建了新的基准测试集AndroidLH。实验结果显示，Mirage-1在AndroidWorld、MobileMiniWob++、Mind2Web-Live和AndroidLH上的表现分别比以往代理提升了32%、19%、15%和79%。项目页面：https://cybertronagent.github.io/Mirage-1.github.io/

English

Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI agents have yielded promising outcomes. However, these agents still struggle with long-horizon tasks in online environments, primarily due to insufficient knowledge and the inherent gap between offline and online domains. In this paper, inspired by how humans generalize knowledge in open-ended environments, we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge. It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning. To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm, which efficiently leverages skills acquired in offline environments to reduce the action search space during online tree exploration. Building on HMS, we propose Mirage-1, a multimodal, cross-platform, plug-and-play GUI agent. To validate the performance of Mirage-1 in real-world long-horizon scenarios, we constructed a new benchmark, AndroidLH. Experimental results show that Mirage-1 outperforms previous agents by 32\%, 19\%, 15\%, and 79\% on AndroidWorld, MobileMiniWob++, Mind2Web-Live, and AndroidLH, respectively. Project page: https://cybertronagent.github.io/Mirage-1.github.io/

Mirage-1：通过分层多模态技能增强与更新GUI智能体

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

摘要

Support