AndroTMem: 장기적 GUI 에이전트를 위한 상호작용 궤적에서 고정 메모리로

초록

장기적 GUI 에이전트는 실전 배포를 위한 핵심 단계이지만, 현재 패러다임 하에서 효과적인 상호작용 메모리는 아직 충분히 연구되지 않았습니다. 전체 상호작용 시퀀스를 재생하는 방식은 중복성을 유발하고 노이즈를 증폭시키는 반면, 요약 방식은 종종 의존성에 중요한 정보와 추적 가능성을 제거합니다. 본 논문에서는 장기적 Android GUI 에이전트를 위한 고정 메모리 진단 프레임워크인 AndroTMem을 제시합니다. 핵심 벤치마크인 AndroTMem-Bench는 34,473개의 상호작용 단계(평균 32.1, 최대 65)로 구성된 1,069개 작업을 포함합니다. 우리는 TCR(작업 완료율)을 중심으로 에이전트를 평가하며, 특히 중요한 중간 상태를 이월해야 완료되는 작업에 집중합니다. AndroTMem-Bench는 강력한 단계 간 인과적 의존성을 강제하도록 설계되어, 희소하지만 필수적인 중간 상태가 하류 작업에 결정적 역할을 하게 하고 평가에서 상호작용 메모리의 중요성을 부각시킵니다. 오픈소스 및 클로즈드소스 GUI 에이전트 전반에 걸쳐 일관된 패턴을 관찰했습니다: 상호작용 시퀀스가 길어질수록 성능 저하는 주로 작업 내 메모리 실패에 기인하며, 고립된 인식 오류나 지역적 행동 오류가 아닙니다. 이러한 진단 결과를 바탕으로, 우리는 인과적으로 연결된 중간 상태 앵커의 간결한 집합으로 상호작용 시퀀스를 표현하여 하위 목표 지향 검색과 귀인 인식 의사 결정을 가능하게 하는 Anchored State Memory(ASM)를 제안합니다. 다양한 환경과 평가된 12개 GUI 에이전트에서 ASM은 전체 시퀀스 재생 및 요약 기반 베이스라인을 지속적으로 능가하며, TCR을 5%~30.16%, AMS를 4.93%~24.66% 향상시켜, 고정된 구조화된 메모리가 장기적 GUI 작업의 상호작용 메모리 병목 현상을 효과적으로 완화함을 입증했습니다. 코드, 벤치마크 및 관련 자료는 [https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem)에서 공개되어 있습니다.

English

Long-horizon GUI agents are a key step toward real-world deployment, yet effective interaction memory under prevailing paradigms remains under-explored. Replaying full interaction sequences is redundant and amplifies noise, while summaries often erase dependency-critical information and traceability. We present AndroTMem, a diagnostic framework for anchored memory in long-horizon Android GUI agents. Its core benchmark, AndroTMem-Bench, comprises 1,069 tasks with 34,473 interaction steps (avg. 32.1 per task, max. 65). We evaluate agents with TCR (Task Complete Rate), focusing on tasks whose completion requires carrying forward critical intermediate state; AndroTMem-Bench is designed to enforce strong step-to-step causal dependencies, making sparse yet essential intermediate states decisive for downstream actions and centering interaction memory in evaluation. Across open- and closed-source GUI agents, we observe a consistent pattern: as interaction sequences grow longer, performance drops are driven mainly by within-task memory failures, not isolated perception errors or local action mistakes. Guided by this diagnosis, we propose Anchored State Memory (ASM), which represents interaction sequences as a compact set of causally linked intermediate-state anchors to enable subgoal-targeted retrieval and attribution-aware decision making. Across multiple settings and 12 evaluated GUI agents, ASM consistently outperforms full-sequence replay and summary-based baselines, improving TCR by 5%-30.16% and AMS by 4.93%-24.66%, indicating that anchored, structured memory effectively mitigates the interaction-memory bottleneck in long-horizon GUI tasks. The code, benchmark, and related resources are publicly available at [https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem).

AndroTMem: 장기적 GUI 에이전트를 위한 상호작용 궤적에서 고정 메모리로

AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

초록

Support