MementoGUI：學習長時域GUI代理的自主多模態記憶控制

摘要

近期基于GUI的代理在视觉定位和动作预测方面取得了显著进展，但在需要跨多个界面转换维护任务状态的长期任务中仍显脆弱。现有代理通常依赖原始历史回放或纯文本记忆，这要么因冗余截图淹没模型，要么丢弃未来决策所需的局部视觉证据。针对这些局限，我们提出MementoGUI——一种即插即用的代理记忆框架，通过为基于多模态大语言模型（MLLM）的GUI代理配备MementoCore（一种用于在线记忆选择、压缩和检索的学习型控制器）来增强其能力。MementoGUI将长期GUI控制问题重新定义为在线记忆控制问题：工作记忆通过文本摘要和感兴趣区域（ROI）级视觉证据，选择性保留任务相关的界面事件；而情景记忆则通过习得的相关性选择，检索可复用的历史轨迹。MementoCore将记忆控制模块化为专用算子，涵盖步骤处理、记忆压缩、情景写入和情景选择，从而在无需微调GUI代理主干的情况下实现即插即用的记忆增强。我们进一步开发了可扩展的数据整理流程，将计算机操作轨迹转化为记忆控制器训练数据；提出了MementoGUI-Bench基准，用于评估GUI代理在长期决策中的表现；并设计了基于MLLM的评价指标，用于语义动作匹配、任务进度和记忆一致性评估。在GUI-Odyssey、MM-Mind2Web和MementoGUI-Bench上的实验表明，MementoGUI相较于无历史、历史回放和纯文本记忆基线，始终能提升GUI代理的性能，且更强的MementoCore主干进一步强化了记忆增强型GUI控制。

English

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce MementoGUI, a plug-in agentic memory framework that equips MLLM-based GUI agents with MementoCore, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce MementoGUI-Bench for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.