MementoGUI: 長期GUIエージェントのためのエージェント的マルチモーダルメモリ制御の学習

要旨

最近のGUIエージェントは、視覚的グラウンディングとアクション予測において大きな進歩を遂げているが、多くのインターフェース遷移をまたいでタスク状態を維持する必要がある長期的タスクでは依然として脆弱である。既存のエージェントは通常、生の履歴再生やテキストのみのメモリに依存しており、これは冗長なスクリーンショットでモデルを圧倒するか、将来の意思決定に必要な局所的な視覚的証拠を破棄することになる。これらの限界に対処するため、我々はMementoGUIを提案する。これはプラグイン型エージェントメモリフレームワークであり、MLLMベースのGUIエージェントに、オンラインメモリの選択、圧縮、検索のための学習されたコントローラであるMementoCoreを組み込む。MementoGUIは、対話履歴を固定されたコンテキストとして扱うのではなく、長期的なGUI制御をオンラインメモリ制御問題として定式化する。ワーキングメモリは、テキスト要約とROIレベルの視覚的証拠を用いてタスク関連のインターフェースイベントを選択的に保持し、エピソード記憶は学習された関連性選択を通じて再利用可能な過去の軌跡を検索する。MementoCoreは、メモリ制御をステップ処理、メモリ圧縮、エピソード書き込み、エピソード選択のための特殊なオペレータにモジュール化し、GUIエージェントのバックボーンをファインチューニングすることなくプラグイン型のメモリ拡張を可能にする。さらに、コンピュータ操作の軌跡をメモリコントローラの学習データに変換するスケーラブルなデータキュレーションパイプラインを開発し、GUIエージェントにおける長期的な意思決定を評価するためのMementoGUI-Benchを導入し、意味的アクションマッチング、タスク進捗、メモリ一貫性のためのMLLMベースの評価指標を設計する。GUI-Odyssey、MM-Mind2Web、MementoGUI-Benchでの実験により、MementoGUIは履歴なし、履歴再生、テキストのみのメモリベースラインと比較して一貫してGUIエージェントを改善し、より大規模なMementoCoreバックボーンがメモリ拡張GUI制御をさらに強化することが示された。

English

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce MementoGUI, a plug-in agentic memory framework that equips MLLM-based GUI agents with MementoCore, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce MementoGUI-Bench for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.