MemGUI-Agent: 一种具备主动上下文管理的端到端长程移动GUI智能体

摘要

基于MLLM的移动GUI智能体在短期任务上取得了显著进展，但在需要跨多个步骤和应用程序转换保留中间事实的长期任务上仍不可靠。我们将此限制归因于ReAct风格的提示方法，该方法被动累积每步记录，导致提示膨胀以及关键跨应用事实被稀释。为解决这一问题，我们提出MemGUI-Agent，一种具备主动上下文管理能力的端到端长期移动GUI智能体。MemGUI-Agent基于“上下文即动作”（Context-as-Action，ConAct）构建，该方法将上下文管理视为与选择UI动作相同策略所发出的首要动作。与被动追加历史记录不同，ConAct维护三个结构化上下文字段：折叠后的动作历史、折叠后的UI状态以及最近步骤记录，在保持上下文紧凑的同时保留关键UI事实。为使主动上下文管理可跨模型规模学习，我们构建了MemGUI-3K数据集，包含2956条轨迹及完整ConAct标注，用于监督训练和离线分析。在MemGUI-3K上训练8B模型得到MemGUI-8B-SFT，该8B规模MemGUI-Agent在MemGUI-Bench上实现了最佳开放数据8B性能，并泛化至分布外的MobileWorld基准测试。代码、数据及训练模型将发布于https://memgui-agent.github.io/。

English

MLLM-based mobile GUI agents have made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation to ReAct-style prompting, which passively accumulates per-step records, leading to prompt explosion and dilution of critical cross-app facts. To address this, we introduce MemGUI-Agent, an end-to-end long-horizon mobile GUI agent with proactive context management. MemGUI-Agent is built on Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains three structured context fields: folded action history, folded UI state, and recent step record, preserving critical UI facts while keeping context compact. To make proactive context management learnable across model scales, we construct MemGUI-3K, a 2,956-trajectory dataset with full ConAct annotations for supervised training and offline analysis. Training an 8B model on MemGUI-3K produces MemGUI-8B-SFT, an 8B MemGUI-Agent that achieves the best open-data 8B performance on MemGUI-Bench and generalizes to the out-of-distribution MobileWorld benchmark. Code, data, and trained models will be released at https://memgui-agent.github.io/.