MemGUI-Agent: 능동적 컨텍스트 관리를 갖춘 종단간 장기 지평 모바일 GUI 에이전트

초록

MLLM 기반 모바일 GUI 에이전트는 단기 과제에서 상당한 진전을 이루었지만, 여러 단계와 앱 전환에 걸쳐 중간 사실을 유지해야 하는 장기 과제에서는 여전히 신뢰성이 낮습니다. 우리는 이러한 한계를 ReAct 스타일 프롬프팅에 기인한다고 보는데, 이는 단계별 기록을 수동적으로 축적하여 프롬프트가 팽창하고 중요한 교차 앱 사실이 희석되게 만듭니다. 이를 해결하기 위해, 우리는 선제적 컨텍스트 관리를 갖춘 종단 간 장기 과제 모바일 GUI 에이전트인 MemGUI-Agent를 소개합니다. MemGUI-Agent는 Context-as-Action(ConAct)을 기반으로 구축되었으며, ConAct는 컨텍스트 관리를 UI 액션을 선택하는 것과 동일한 정책이 생성하는 일급 액션으로 간주합니다. ConAct는 수동적으로 기록을 추가하는 대신, 접힌 액션 이력, 접힌 UI 상태, 최근 단계 기록이라는 세 가지 구조화된 컨텍스트 필드를 유지하여 컨텍스트를 간결하게 유지하면서 중요한 UI 사실을 보존합니다. 선제적 컨텍스트 관리를 다양한 모델 규모에서 학습 가능하게 만들기 위해, 우리는 지도 학습 및 오프라인 분석을 위한 전체 ConAct 주석이 포함된 2,956개 트래젝토리 데이터셋인 MemGUI-3K를 구축했습니다. MemGUI-3K로 8B 모델을 학습시킨 결과, MemGUI-Bench에서 최고의 공개 데이터 8B 성능을 달성하고 분포 외인 MobileWorld 벤치마크로 일반화되는 8B MemGUI-Agent인 MemGUI-8B-SFT를 얻었습니다. 코드, 데이터 및 학습된 모델은 https://memgui-agent.github.io/에서 공개될 예정입니다.

English

MLLM-based mobile GUI agents have made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation to ReAct-style prompting, which passively accumulates per-step records, leading to prompt explosion and dilution of critical cross-app facts. To address this, we introduce MemGUI-Agent, an end-to-end long-horizon mobile GUI agent with proactive context management. MemGUI-Agent is built on Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains three structured context fields: folded action history, folded UI state, and recent step record, preserving critical UI facts while keeping context compact. To make proactive context management learnable across model scales, we construct MemGUI-3K, a 2,956-trajectory dataset with full ConAct annotations for supervised training and offline analysis. Training an 8B model on MemGUI-3K produces MemGUI-8B-SFT, an 8B MemGUI-Agent that achieves the best open-data 8B performance on MemGUI-Bench and generalizes to the out-of-distribution MobileWorld benchmark. Code, data, and trained models will be released at https://memgui-agent.github.io/.