MementoGUI：面向长周期GUI智能体的主体性多模态记忆控制学习

摘要

近期基于图形用户界面（GUI）的代理在视觉定位和动作预测方面取得了显著进展，但在需要跨多个界面转换保持任务状态的长期任务中，它们仍然脆弱。现有代理通常依赖原始历史回放或纯文本记忆，这种方式要么用冗余截图使模型不堪重负，要么丢弃未来决策所需的局部视觉证据。为解决这些局限，我们提出了MementoGUI——一种即插即用的代理记忆框架，通过为基于多模态大语言模型（MLLM）的GUI代理配备MementoCore（一种用于在线记忆选择、压缩和检索的学习型控制器），实现记忆增强。MementoGUI并未将交互历史视为固定上下文，而是将长周期GUI控制建模为在线记忆控制问题：工作记忆通过文本摘要和感兴趣区域（ROI）级别的视觉证据，选择性保留与任务相关的界面事件；而情节记忆则通过学习的相关性选择，检索可复用的历史轨迹。MementoCore将记忆控制模块化为专用操作符，涵盖步骤处理、记忆压缩、情节写入和情节选择等环节，从而实现对GUI代理主干模型的无微调即插即用记忆增强。我们进一步开发了可扩展的数据处理流水线，将计算机使用轨迹转化为记忆控制器训练数据；引入MementoGUI-Bench用于评估GUI代理的长周期决策能力；并设计了基于MLLM的指标用于语义动作匹配、任务进度和记忆一致性评估。在GUI-Odyssey、MM-Mind2Web和MementoGUI-Bench上的实验表明，MementoGUI持续优于无历史、历史回放和纯文本记忆等基线方法，且更大的MementoCore骨干模型能进一步增强记忆增强的GUI控制能力。

English

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce MementoGUI, a plug-in agentic memory framework that equips MLLM-based GUI agents with MementoCore, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce MementoGUI-Bench for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.