エコーメモリー：行動世界モデルにおける記憶の対照研究

要旨

我々は、「Echo-Memory」を提示する。これは、アクション条件付きワールドモデルにおける記憶メカニズムの制御された研究である。これらのモデルは、最初のフレーム、テキストプロンプト、カメラアクションのシーケンスからマルチセグメント動画を生成するが、その主要な失敗点は、局所的な画像合成よりもむしろ記憶にあることが多い。すなわち、カメラが移動して戻ってきた後に、シーンや顕著な物体が静かに変化することがある。既存の記憶設計は比較が困難である。なぜなら、性能向上がバックボーン、学習、検索、評価の違いと複雑に絡み合っているからである。Echo-Memoryは、アクションから動画へのインターフェースを固定し、生成器が履歴を保存・読み出す方法のみを変化させる。共有の動画拡散バックボーン、最適化器、カメラアクション表現、サンプラー、評価パイプラインの下で、未加工コンテキスト、圧縮ベースの記憶、異なる読み出し経路を持つ空間要約、状態空間再帰を比較する。このマッチング行列は、容量、圧縮、読み出し、再帰という、他では混同されがちな四つの軸を分離する。また、三つの分岐からなるプロトコル（再現品質、ドメイン内ループ再訪問、オープンドメイン復帰プローブ）を通じて記憶を評価する。これらの分岐はしばしば一致せず、再現忠実度が世界を記憶するための十分な代理指標ではないことを示している。そこから三つの知見が得られる。未加工コンテキストは強力な容量ベースラインであり、再現指標を改善するよりもはるかに大きくオープンドメイン復帰を向上させる。コンパクトさは容量の無料の代替品ではない。過激な圧縮やハイブリッド圧縮記憶は、復帰に必要な顕著な証拠を失ってしまう。最後に、ブロック単位の状態空間再帰は、我々の行列において最も強力なオープンドメイン復帰メカニズムであり、暗黙的記憶の構造が、それを使用するという決定と同じくらい重要であることを示している。これらの結果は、孤立した再現指標を超えてアクションワールドモデルにおける記憶を研究するためのコンパクトなプロトコルを提供する。

English

We present Echo-Memory, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: capacity, compression, read-out, and recurrence. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.