AndroTMem：從互動軌跡到長時程圖形介面代理的錨定記憶系統

摘要

長時程圖形使用者介面代理是邁向現實世界部署的關鍵一步，然而現行範式下的有效互動記憶機制仍待深入探索。完整重放互動序列會產生冗餘並放大噪聲，而摘要式記憶往往抹除依賴關鍵資訊與可追溯性。我們提出 AndroTMem——專為長時程 Android GUI 代理設計的錨定記憶診斷框架。其核心基準測試 AndroTMem-Bench 包含 1,069 項任務與 34,473 個互動步驟（平均每任務 32.1 步，最高 65 步）。我們以任務完成率評估代理表現，聚焦於需傳遞關鍵中間狀態方能完成的任務；該基準通過強化步驟間因果依賴關係，使稀疏但關鍵的中間狀態成為下游行動的決勝點，並將互動記憶置於評估核心。在開源與閉源 GUI 代理的測試中，我們觀察到一致規律：隨著互動序列增長，效能下降主要源自任務內記憶失效，而非孤立感知錯誤或局部操作失誤。基於此診斷，我們提出錨定狀態記憶法，將互動序列表示為因果連結的中間狀態錨點集合，實現子目標導向檢索與歸因感知決策。在多重設定與 12 款 GUI 代理的測試中，ASM 始終優於完整序列重放與摘要式基準方法，任務完成率提升 5%-30.16%，錨定記憶分數提升 4.93%-24.66%，證明結構化錨定記憶能有效緩解長時程 GUI 任務的互動記憶瓶頸。程式碼、基準測試及相關資源已公開於 [https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem)。

English

Long-horizon GUI agents are a key step toward real-world deployment, yet effective interaction memory under prevailing paradigms remains under-explored. Replaying full interaction sequences is redundant and amplifies noise, while summaries often erase dependency-critical information and traceability. We present AndroTMem, a diagnostic framework for anchored memory in long-horizon Android GUI agents. Its core benchmark, AndroTMem-Bench, comprises 1,069 tasks with 34,473 interaction steps (avg. 32.1 per task, max. 65). We evaluate agents with TCR (Task Complete Rate), focusing on tasks whose completion requires carrying forward critical intermediate state; AndroTMem-Bench is designed to enforce strong step-to-step causal dependencies, making sparse yet essential intermediate states decisive for downstream actions and centering interaction memory in evaluation. Across open- and closed-source GUI agents, we observe a consistent pattern: as interaction sequences grow longer, performance drops are driven mainly by within-task memory failures, not isolated perception errors or local action mistakes. Guided by this diagnosis, we propose Anchored State Memory (ASM), which represents interaction sequences as a compact set of causally linked intermediate-state anchors to enable subgoal-targeted retrieval and attribution-aware decision making. Across multiple settings and 12 evaluated GUI agents, ASM consistently outperforms full-sequence replay and summary-based baselines, improving TCR by 5%-30.16% and AMS by 4.93%-24.66%, indicating that anchored, structured memory effectively mitigates the interaction-memory bottleneck in long-horizon GUI tasks. The code, benchmark, and related resources are publicly available at [https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem).

AndroTMem：從互動軌跡到長時程圖形介面代理的錨定記憶系統

AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

摘要

Support