MemGUI-Agent: 能動的コンテキスト管理を備えたエンドツーエンドの長期的モバイルGUIエージェント

要旨

MLLMベースのモバイルGUIエージェントは短期的タスクにおいて大きな進歩を遂げていますが、多くのステップやアプリ間の遷移にわたって中間的な情報を保持する必要がある長期的タスクでは依然として信頼性が低いです。我々はこの制限を、各ステップの記録を受動的に蓄積し、プロンプトの爆発的増加と重要なクロスアプリ情報の希薄化を引き起こすReActスタイルのプロンプティングに起因すると考えています。この問題に対処するため、我々はプロアクティブなコンテキスト管理を備えたエンドツーエンドの長期的モバイルGUIエージェントであるMemGUI-Agentを導入します。MemGUI-AgentはContext-as-Action（ConAct）に基づいて構築されており、ConActはコンテキスト管理を、UIアクションを選択するのと同じポリシーによって発行される第一級のアクションとして扱います。履歴を受動的に追加する代わりに、ConActは3つの構造化されたコンテキストフィールド、すなわち折り畳まれたアクション履歴、折り畳まれたUI状態、および直近のステップ記録を維持し、コンテキストをコンパクトに保ちながら重要なUI情報を保持します。プロアクティブなコンテキスト管理をモデルスケール全体で学習可能にするために、我々は教師あり学習とオフライン分析のための完全なConActアノテーションを備えた2,956の軌跡からなるデータセットMemGUI-3Kを構築します。MemGUI-3Kで8Bモデルを訓練することにより、MemGUI-Benchで最高のオープンデータ8Bパフォーマンスを達成し、分布外のMobileWorldベンチマークに一般化する8B MemGUI-AgentであるMemGUI-8B-SFTが生成されます。コード、データ、および訓練済みモデルは https://memgui-agent.github.io/ で公開される予定です。

English

MLLM-based mobile GUI agents have made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation to ReAct-style prompting, which passively accumulates per-step records, leading to prompt explosion and dilution of critical cross-app facts. To address this, we introduce MemGUI-Agent, an end-to-end long-horizon mobile GUI agent with proactive context management. MemGUI-Agent is built on Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains three structured context fields: folded action history, folded UI state, and recent step record, preserving critical UI facts while keeping context compact. To make proactive context management learnable across model scales, we construct MemGUI-3K, a 2,956-trajectory dataset with full ConAct annotations for supervised training and offline analysis. Training an 8B model on MemGUI-3K produces MemGUI-8B-SFT, an 8B MemGUI-Agent that achieves the best open-data 8B performance on MemGUI-Bench and generalizes to the out-of-distribution MobileWorld benchmark. Code, data, and trained models will be released at https://memgui-agent.github.io/.