ELMUR:面向长时程强化学习的外层记忆更新/重写机制
ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL
October 8, 2025
作者: Egor Cherepanov, Alexey K. Kovalev, Aleksandr I. Panov
cs.AI
摘要
现实世界中的机器人代理必须在部分可观测性和长时程环境下行动,其中关键线索可能在影响决策之前早已出现。然而,大多数现代方法仅依赖瞬时信息,未能整合过去的洞察。标准的循环或Transformer模型在保留和利用长期依赖关系方面存在困难:上下文窗口截断了历史,而简单的内存扩展在规模和稀疏性面前表现不佳。我们提出了ELMUR(带更新/重写功能的外部层内存),一种具有结构化外部内存的Transformer架构。每一层都维护内存嵌入,通过双向交叉注意力与之交互,并使用最近最少使用(LRU)内存模块通过替换或凸混合来更新它们。ELMUR将有效视野扩展到注意力窗口的100,000倍,并在长达一百万步的合成T-Maze任务中实现了100%的成功率。在POPGym中,它在超过一半的任务上超越了基线模型。在MIKASA-Robo稀疏奖励的视觉观察操作任务中,它几乎将强基线性能提升了一倍。这些结果表明,结构化的、层局部的外部内存为部分可观测性下的决策提供了一种简单且可扩展的方法。
English
Real-world robotic agents must act under partial observability and long
horizons, where key cues may appear long before they affect decision making.
However, most modern approaches rely solely on instantaneous information,
without incorporating insights from the past. Standard recurrent or transformer
models struggle with retaining and leveraging long-term dependencies: context
windows truncate history, while naive memory extensions fail under scale and
sparsity. We propose ELMUR (External Layer Memory with Update/Rewrite), a
transformer architecture with structured external memory. Each layer maintains
memory embeddings, interacts with them via bidirectional cross-attention, and
updates them through an Least Recently Used (LRU) memory module using
replacement or convex blending. ELMUR extends effective horizons up to 100,000
times beyond the attention window and achieves a 100% success rate on a
synthetic T-Maze task with corridors up to one million steps. In POPGym, it
outperforms baselines on more than half of the tasks. On MIKASA-Robo
sparse-reward manipulation tasks with visual observations, it nearly doubles
the performance of strong baselines. These results demonstrate that structured,
layer-local external memory offers a simple and scalable approach to decision
making under partial observability.