Latentes räumliches Gedächtnis für Video-Weltmodelle

Zusammenfassung

Videoweltmodelle, die über generierte Einzelbilder hinweg 3D-Raumkonsistenz aufrechterhalten, basieren typischerweise auf explizitem Punktwolkenspeicher, der im RGB-Raum konstruiert wird. Dieses Design ist sowohl rechenintensiv, da wiederholtes Rendern und VAE-Kodierung erforderlich sind, als auch von Natur aus verlustbehaftet, da der Hin- und Rückweg durch den Pixelraum wertvolle Merkmale der erlernten latenten Repräsentation verwirft. In dieser Arbeit führen wir latenten räumlichen Speicher für Videoweltmodelle ein – einen persistenten 3D-Cache, der Szeneninformationen direkt im Diffusions-Latentraum speichert und eine Rekonstruktion im Pixelraum vermeidet. Darauf aufbauend schlagen wir Mirage vor, ein Framework für latenten räumlichen Speicher, das den Speicher durch Heben latenter Token in 3D mittels tiefengeführter Rückprojektion konstruiert und Abfragen durch Synthese neuer Ansichten mittels direktem Warping im Latentraum ermöglicht. Diese einheitliche Formulierung beseitigt sowohl den Informationsverlust der Rekonstruktion im Pixelraum als auch den Rechenaufwand durch wiederholte Kodierung und Rendering. Experimente zeigen, dass latenter räumlicher Speicher im Vergleich zu expliziten 3D-Baselines eine bis zu 10,57-mal schnellere End-to-End-Videogenerierung und eine 55-fache Reduktion des Speicherbedarfs erreicht. Unter Ausnutzung der geometrischen Vorkenntnis des Diffusionsmodells erzielt Mirage Spitzenleistungen auf WorldScore und eine hohe Rekonstruktionsqualität auf RealEstate10K.

English

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce latent spatial memory for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to 10.57times faster end-to-end video generation and 55times reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.