비디오 세계 모델을 위한 잠재 공간 기억

초록

생성된 프레임 간 3D 공간 일관성을 유지하는 비디오 세계 모델은 일반적으로 RGB 공간에서 구성된 명시적 포인트 클라우드 메모리에 의존한다. 이러한 설계는 반복적인 렌더링과 VAE 인코딩을 필요로 하여 계산적으로 비용이 많이 들 뿐만 아니라, 픽셀 공간을 통한 순환 과정에서 학습된 잠재 표현의 풍부한 특징이 소실되어 본질적으로 손실이 발생한다. 본 논문에서는 비디오 세계 모델을 위한 잠재 공간 메모리(latent spatial memory)를 소개한다. 이는 확산 잠재 공간에서 장면 정보를 직접 저장하는 영구적 3D 캐시로, 픽셀 공간 재구성을 피한다. 이를 바탕으로 우리는 Mirage를 제안한다. Mirage는 잠재 토큰을 깊이 유도 역투영(depth-guided back-projection)을 통해 3D로 변환하여 메모리를 구축하고, 직접적인 잠재 공간 와핑(latent-space warping)을 통해 새로운 시점을 합성함으로써 메모리를 질의하는 잠재 공간 기반 3D 메모리 프레임워크이다. 이 통합된 공식은 픽셀 공간 재구성의 정보 손실과 반복적인 인코딩 및 렌더링의 계산 부담을 동시에 제거한다. 실험 결과, 잠재 공간 메모리는 명시적 3D 기준선 대비 종단 간 비디오 생성 속도에서 최대 10.57배 향상, 메모리 사용량에서 55배 감소를 달성한다. 확산 모델의 기하학적 사전 지식을 활용하여 Mirage는 WorldScore에서 최첨단 성능을 달성하고 RealEstate10K에서 강력한 재구성 품질을 보여준다.

English

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce latent spatial memory for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to 10.57times faster end-to-end video generation and 55times reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.