回聲記憶：行動世界模型中記憶的對照研究

摘要

我們提出Echo-Memory，這是一項針對動作條件世界模型中記憶機制的受控研究。此類模型能根據初始幀、文字提示與相機動作序列，生成多段影片，但其核心失敗往往來自記憶問題，而非局部影像合成：當相機移開後再返回時，場景或顯著物體可能悄然改變。現有的記憶設計難以進行比較，因為其增益常與骨幹網路、訓練方式、檢索機制及評估流程的差異糾纏不清。Echo-Memory固定了動作到影片的介面，僅改變生成器儲存與讀取歷史資訊的方式。在共享的影片擴散骨幹網路、優化器、相機動作表示、取樣器及評估流程下，我們比較了原始上下文、基於壓縮的記憶、具有不同讀取路徑的空間摘要，以及狀態空間遞迴。這個匹配的矩陣分離了四個本來常被混淆的面向：容量、壓縮、讀取與遞迴。我們也透過三支線協議來評估記憶：回放品質、域內循環重訪，以及開放域返回探測。這三條支線的結果經常不一致，顯示回放保真度並不足以作為記憶世界的代理指標。由此得出三項發現。原始上下文是一個強大的容量基準線，它對開放域返回的提升遠大於對回放指標的改善。緊湊性無法免費替代容量：激進的空間與混合壓縮記憶會遺失返回所需的重要證據。最後，分塊狀態空間遞迴是我們矩陣中最強的開放域返回機制，顯示隱式記憶的結構與是否使用記憶的決策同樣重要。這些結果提供了一個超越孤立回放指標的緊湊協議，用以研究動作世界模型中的記憶機制。

English

We present Echo-Memory, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: capacity, compression, read-out, and recurrence. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.