MementoGUI: 장기적 GUI 에이전트를 위한 에이전트적 멀티모달 메모리 제어 학습

초록

최근 GUI 에이전트는 시각적 근거와 행동 예측에서 상당한 진전을 이루었으나, 다수의 인터페이스 전환에 걸쳐 작업 상태를 유지해야 하는 장기적 과제에서는 여전히 취약한 모습을 보인다. 기존 에이전트는 일반적으로 원시 히스토리 재생이나 텍스트 전용 메모리에 의존하며, 이는 중복된 스크린샷으로 모델을 압도하거나 향후 결정에 필요한 국소적 시각 증거를 폐기하는 결과를 초래한다. 이러한 한계를 극복하기 위해, 우리는 MementoGUI를 소개한다. 이는 MLLM 기반 GUI 에이전트에 MementoCore를 장착한 플러그인 에이전틱 메모리 프레임워크로, MementoCore는 온라인 메모리 선택, 압축, 검색을 위한 학습된 컨트롤러이다. MementoGUI는 상호작용 히스토리를 고정된 컨텍스트로 취급하는 대신, 장기적 GUI 제어를 온라인 메모리 제어 문제로 정식화한다. 작업 메모리는 텍스트 요약 및 ROI 수준의 시각적 증거를 통해 작업 관련 인터페이스 이벤트를 선택적으로 보존하며, 일화 메모리는 학습된 관련성 선택을 통해 재사용 가능한 과거 궤적을 검색한다. MementoCore는 메모리 제어를 단계 처리, 메모리 압축, 일화적 쓰기, 일화적 선택을 위한 전문화된 연산자로 모듈화하여, GUI 에이전트 백본을 미세 조정하지 않고도 플러그인 메모리 증강을 가능하게 한다. 또한, 컴퓨터 사용 궤적을 메모리 컨트롤러 학습 데이터로 변환하는 확장 가능한 데이터 큐레이션 파이프라인을 개발하고, GUI 에이전트의 장기적 의사 결정 평가를 위한 MementoGUI-Bench를 도입하며, 의미적 행동 매칭, 작업 진행 상황, 메모리 일관성을 위한 MLLM 기반 메트릭을 설계한다. GUI-Odyssey, MM-Mind2Web, MementoGUI-Bench에 대한 실험 결과, MementoGUI는 히스토리 없음, 히스토리 재생, 텍스트 전용 메모리 기준선에 비해 GUI 에이전트를 일관되게 개선하며, 더 큰 MementoCore 백본은 메모리 증강 GUI 제어를 더욱 강화한다.

English

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce MementoGUI, a plug-in agentic memory framework that equips MLLM-based GUI agents with MementoCore, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce MementoGUI-Bench for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.