Mirage-1: 계층적 멀티모달 기술을 통한 GUI 에이전트의 강화 및 업데이트

초록

최근 멀티모달 대형 언어 모델(MLLM)을 GUI 에이전트로 활용하려는 시도가 유망한 결과를 보여주고 있습니다. 그러나 이러한 에이전트들은 여전히 온라인 환경에서 장기적인 작업을 수행하는 데 어려움을 겪고 있으며, 이는 주로 지식 부족과 오프라인 및 온라인 도메인 간의 고유한 간극 때문입니다. 본 논문에서는 인간이 개방형 환경에서 지식을 일반화하는 방식에서 영감을 받아, 지식 부족 문제를 해결하기 위해 계층적 멀티모달 스킬(HMS) 모듈을 제안합니다. 이 모듈은 궤적을 실행 스킬, 핵심 스킬, 그리고 궁극적으로 메타 스킬로 점진적으로 추상화하여 장기적인 작업 계획을 위한 계층적 지식 구조를 제공합니다. 도메인 간 간극을 해소하기 위해, 오프라인 환경에서 습득한 스킬을 효율적으로 활용하여 온라인 트리 탐색 중 액션 검색 공간을 줄이는 스킬-증강 몬테카를로 트리 탐색(SA-MCTS) 알고리즘을 제안합니다. HMS를 기반으로, 우리는 멀티모달, 크로스 플랫폼, 플러그 앤 플레이 GUI 에이전트인 Mirage-1을 제안합니다. Mirage-1의 실세계 장기 시나리오에서의 성능을 검증하기 위해 새로운 벤치마크인 AndroidLH를 구축했습니다. 실험 결과, Mirage-1은 AndroidWorld, MobileMiniWob++, Mind2Web-Live, 그리고 AndroidLH에서 각각 32%, 19%, 15%, 79%의 성능 향상을 보였습니다. 프로젝트 페이지: https://cybertronagent.github.io/Mirage-1.github.io/

English

Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI agents have yielded promising outcomes. However, these agents still struggle with long-horizon tasks in online environments, primarily due to insufficient knowledge and the inherent gap between offline and online domains. In this paper, inspired by how humans generalize knowledge in open-ended environments, we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge. It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning. To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm, which efficiently leverages skills acquired in offline environments to reduce the action search space during online tree exploration. Building on HMS, we propose Mirage-1, a multimodal, cross-platform, plug-and-play GUI agent. To validate the performance of Mirage-1 in real-world long-horizon scenarios, we constructed a new benchmark, AndroidLH. Experimental results show that Mirage-1 outperforms previous agents by 32\%, 19\%, 15\%, and 79\% on AndroidWorld, MobileMiniWob++, Mind2Web-Live, and AndroidLH, respectively. Project page: https://cybertronagent.github.io/Mirage-1.github.io/

Mirage-1: 계층적 멀티모달 기술을 통한 GUI 에이전트의 강화 및 업데이트

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

초록

Support