Mirage-1：階層型マルチモーダルスキルによるGUIエージェントの拡張と更新

要旨

近年、マルチモーダル大規模言語モデル（MLLM）をGUIエージェントとして活用する取り組みが有望な成果を上げています。しかし、これらのエージェントは、オンライン環境における長期的なタスクにおいて依然として苦戦しており、その主な原因は知識の不足と、オフラインとオンラインのドメイン間の本質的なギャップにあります。本論文では、人間がオープンエンド環境で知識を一般化する方法に着想を得て、知識不足の問題に取り組むための階層型マルチモーダルスキル（HMS）モジュールを提案します。このモジュールは、軌跡を実行スキル、コアスキル、そして最終的にはメタスキルへと段階的に抽象化し、長期的なタスク計画のための階層的な知識構造を提供します。ドメインギャップを埋めるために、オフライン環境で獲得したスキルを効率的に活用し、オンラインのツリー探索中のアクション探索空間を削減するスキル拡張モンテカルロ木探索（SA-MCTS）アルゴリズムを提案します。HMSを基盤として、マルチモーダルでクロスプラットフォーム、プラグアンドプレイのGUIエージェントであるMirage-1を提案します。Mirage-1の実世界における長期的なシナリオでの性能を検証するために、新しいベンチマークであるAndroidLHを構築しました。実験結果は、Mirage-1がAndroidWorld、MobileMiniWob++、Mind2Web-Live、およびAndroidLHにおいて、それぞれ32％、19％、15％、79％の性能向上を達成し、従来のエージェントを上回ることを示しています。プロジェクトページ：https://cybertronagent.github.io/Mirage-1.github.io/

English

Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI agents have yielded promising outcomes. However, these agents still struggle with long-horizon tasks in online environments, primarily due to insufficient knowledge and the inherent gap between offline and online domains. In this paper, inspired by how humans generalize knowledge in open-ended environments, we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge. It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning. To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm, which efficiently leverages skills acquired in offline environments to reduce the action search space during online tree exploration. Building on HMS, we propose Mirage-1, a multimodal, cross-platform, plug-and-play GUI agent. To validate the performance of Mirage-1 in real-world long-horizon scenarios, we constructed a new benchmark, AndroidLH. Experimental results show that Mirage-1 outperforms previous agents by 32\%, 19\%, 15\%, and 79\% on AndroidWorld, MobileMiniWob++, Mind2Web-Live, and AndroidLH, respectively. Project page: https://cybertronagent.github.io/Mirage-1.github.io/

Mirage-1：階層型マルチモーダルスキルによるGUIエージェントの拡張と更新

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

要旨

Support