에코 메모리: 행동 세계 모델에서의 기억에 대한 통제 연구

초록

우리는 행동 조건화 세계 모델에서의 기억 메커니즘에 대한 통제 연구인 Echo-Memory를 제시한다. 이 모델들은 첫 번째 프레임, 텍스트 프롬프트, 카메라 동작 시퀀스로부터 다중 구간 비디오를 생성하지만, 주된 실패는 국소 이미지 합성보다는 종종 기억에 있다: 카메라가 벗어났다가 돌아온 후, 장면이나 현저한 객체가 조용히 변할 수 있다. 기존의 기억 설계는 이득이 백본, 학습, 검색, 평가의 차이와 얽혀 있어 비교가 어렵다. Echo-Memory는 동작-비디오 인터페이스를 고정하고, 생성기가 기록을 저장하고 읽는 방식만을 변화시킨다. 공유된 비디오 확산 백본, 최적화기, 카메라 동작 표현, 샘플러, 평가 파이프라인 하에서, 우리는 원시 문맥, 압축 기반 기억, 다양한 읽기 경로를 가진 공간 요약, 상태 공간 순환을 비교한다. 이 일치 행렬은 달리 혼동되는 네 가지 축, 즉 용량, 압축, 읽기 경로, 순환을 분리한다. 또한 우리는 세 가지 가지 프로토콜, 즉 재생 품질, 도메인 내 루프 재방문, 개방 도메인 복귀 탐침을 통해 기억을 평가한다. 이 가지들은 종종 일치하지 않으며, 이는 재생 충실도가 세계를 기억하는 충분한 대리 지표가 아님을 보여준다. 세 가지 결과가 도출된다. 원시 문맥은 강력한 용량 기준선이며, 재생 지표를 개선하는 것보다 개방 도메인 복귀를 훨씬 더 크게 개선한다. 컴팩트함은 용량의 무료 대체물이 아니다: 과도한 공간 및 하이브리드 압축 기억은 복귀에 필요한 현저한 증거를 상실한다. 마지막으로, 블록 단위 상태 공간 순환은 우리 행렬에서 가장 강력한 개방 도메인 복귀 메커니즘으로, 암묵적 기억의 구조가 이를 사용하기로 한 결정만큼 중요함을 보여준다. 이러한 결과들은 고립된 재생 지표를 넘어 행동 세계 모델에서 기억을 연구하기 위한 간결한 프로토콜을 제공한다.

English

We present Echo-Memory, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: capacity, compression, read-out, and recurrence. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.