AndroTMem: 長期的GUIエージェントにおけるインタラクション軌跡から固定化メモリへ

要旨

長期的なGUIエージェントは実世界への展開における重要なステップであるが、既存のパラダイム下での効果的なインタラクション記憶は十分に研究されていない。完全なインタラクション系列の再生は冗長でノイズを増幅し、要約は依存関係に重要な情報と追跡可能性を消去しがちである。本論文では、長期的Android GUIエージェントのための固定化記憶診断フレームワークAndroTMemを提案する。中核的ベンチマークであるAndroTMem-Benchは1,069タスク、34,473インタラクションステップ（平均32.1、最大65）で構成される。TCR（タスク完了率）による評価では、重要な中間状態の引き継ぎを必要とするタスクに焦点を当て、AndroTMem-Benchは強固なステップ間因果依存関係を強制するように設計されており、疎ながら不可欠な中間状態が下流行動を決定し、インタラクション記憶を評価の中心に据えている。オープンソース及びクローズドソースGUIエージェントを横断した分析により、インタラクション系列が長くなるにつれ、性能低下が主にタスク内記憶障害によって引き起こされ、単離した知覚誤りや局所的行動誤りではない一貫したパターンが観測された。この診断に基づき、インタラクション系列を因果的に連結された中間状態アンカーのコンパクトな集合として表現する固定化状態記憶（ASM）を提案する。これによりサブゴール指向の検索と帰属意識ある意思決定を実現する。複数設定及び12の評価対象GUIエージェントにおいて、ASMは一貫して完全系列再生と要約ベースのベースラインを上回り、TCRを5%-30.16%、AMSを4.93%-24.66%改善し、固定化された構造化記憶が長期的GUIタスクにおけるインタラクション記憶のボトルネックを効果的に緩和することを示した。コード、ベンチマーク及び関連リソースは[https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem)で公開されている。

English

Long-horizon GUI agents are a key step toward real-world deployment, yet effective interaction memory under prevailing paradigms remains under-explored. Replaying full interaction sequences is redundant and amplifies noise, while summaries often erase dependency-critical information and traceability. We present AndroTMem, a diagnostic framework for anchored memory in long-horizon Android GUI agents. Its core benchmark, AndroTMem-Bench, comprises 1,069 tasks with 34,473 interaction steps (avg. 32.1 per task, max. 65). We evaluate agents with TCR (Task Complete Rate), focusing on tasks whose completion requires carrying forward critical intermediate state; AndroTMem-Bench is designed to enforce strong step-to-step causal dependencies, making sparse yet essential intermediate states decisive for downstream actions and centering interaction memory in evaluation. Across open- and closed-source GUI agents, we observe a consistent pattern: as interaction sequences grow longer, performance drops are driven mainly by within-task memory failures, not isolated perception errors or local action mistakes. Guided by this diagnosis, we propose Anchored State Memory (ASM), which represents interaction sequences as a compact set of causally linked intermediate-state anchors to enable subgoal-targeted retrieval and attribution-aware decision making. Across multiple settings and 12 evaluated GUI agents, ASM consistently outperforms full-sequence replay and summary-based baselines, improving TCR by 5%-30.16% and AMS by 4.93%-24.66%, indicating that anchored, structured memory effectively mitigates the interaction-memory bottleneck in long-horizon GUI tasks. The code, benchmark, and related resources are publicly available at [https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem).

AndroTMem: 長期的GUIエージェントにおけるインタラクション軌跡から固定化メモリへ

AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

要旨

Support