**AndroTMem：从交互轨迹到长视野GUI代理中的锚定记忆**

摘要

长视界GUI智能体是实现现实世界部署的关键一步，然而主流范式下的有效交互记忆机制仍待深入探索。完全回放交互序列会导致冗余并放大噪声，而摘要方式往往会抹除依赖关键信息与可追溯性。我们提出AndroTMem——一个面向长视界Android GUI智能体的锚定记忆诊断框架。其核心基准AndroTMem-Bench包含1,069个任务共34,473个交互步骤（平均每任务32.1步，最多65步）。我们通过任务完成率评估智能体性能，重点关注需要传递关键中间状态的任务；该基准通过强化步间因果依赖设计，使稀疏但关键的中间状态成为下游决策的决定性因素，并将交互记忆作为评估核心。在开源与闭源GUI智能体的测试中，我们观察到一致规律：随着交互序列增长，性能下降主要源于任务内记忆失效，而非孤立感知错误或局部操作失误。基于此诊断，我们提出锚定状态记忆法，将交互序列表示为因果关联的中间状态锚点集合，实现子目标导向的检索与归因感知决策。在多重实验设置下对12款GUI智能体的评估表明，该方法始终优于全序列回放与摘要基线，任务完成率提升5%-30.16%，平均记忆得分提升4.93%-24.66%，证明锚定结构化记忆能有效缓解长视界GUI任务中的交互记忆瓶颈。代码、基准及相关资源已开源于[https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem)。

English

Long-horizon GUI agents are a key step toward real-world deployment, yet effective interaction memory under prevailing paradigms remains under-explored. Replaying full interaction sequences is redundant and amplifies noise, while summaries often erase dependency-critical information and traceability. We present AndroTMem, a diagnostic framework for anchored memory in long-horizon Android GUI agents. Its core benchmark, AndroTMem-Bench, comprises 1,069 tasks with 34,473 interaction steps (avg. 32.1 per task, max. 65). We evaluate agents with TCR (Task Complete Rate), focusing on tasks whose completion requires carrying forward critical intermediate state; AndroTMem-Bench is designed to enforce strong step-to-step causal dependencies, making sparse yet essential intermediate states decisive for downstream actions and centering interaction memory in evaluation. Across open- and closed-source GUI agents, we observe a consistent pattern: as interaction sequences grow longer, performance drops are driven mainly by within-task memory failures, not isolated perception errors or local action mistakes. Guided by this diagnosis, we propose Anchored State Memory (ASM), which represents interaction sequences as a compact set of causally linked intermediate-state anchors to enable subgoal-targeted retrieval and attribution-aware decision making. Across multiple settings and 12 evaluated GUI agents, ASM consistently outperforms full-sequence replay and summary-based baselines, improving TCR by 5%-30.16% and AMS by 4.93%-24.66%, indicating that anchored, structured memory effectively mitigates the interaction-memory bottleneck in long-horizon GUI tasks. The code, benchmark, and related resources are publicly available at [https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem).