Echo-Memory: 行动世界模型中记忆的对照研究

摘要

我们提出Echo-Memory，一项针对动作条件世界模型中记忆机制的受控研究。这类模型能够根据首帧、文本提示和相机动作序列生成多片段视频，但其核心失败往往在于记忆而非局部图像合成：当相机离开并返回时，场景或显著物体可能悄然发生变化。由于现有记忆设计的增益与骨干网络、训练、检索和评估差异相互纠缠，难以进行比较。Echo-Memory通过固定动作到视频的接口，仅改变生成器存储和读取历史的方式，在共享的视频扩散骨干网络、优化器、相机动作表示、采样器和评估流程下，比较了原始上下文、基于压缩的记忆、不同读出路径的空间摘要以及状态空间递归。这一匹配矩阵分离了四个通常被混淆的维度：容量、压缩、读出和递归。我们还通过三分支协议评估记忆：回放质量、域内循环重访和开放域返回探测。这些分支经常产生分歧，表明回放保真度不足以作为记住世界的代理指标。研究得出三点发现：原始上下文是一个强大的容量基线，其对开放域返回的提升远大于对回放指标的改善；紧凑性不能替代容量——激进的空间压缩和混合压缩记忆会丢失返回所需的显著证据；最后，块状状态空间递归是我们矩阵中最强的开放域返回机制，表明隐式记忆的结构与是否使用记忆同样重要。这些结果为研究动作世界模型中的记忆提供了超越孤立回放指标的紧凑协议。

English

We present Echo-Memory, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: capacity, compression, read-out, and recurrence. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.