MoVerse: 파노라마 가우시안 스캐폴드를 활용한 실시간 비디오 세계 모델링

초록

우리는 MoVerse를 제시한다. 이는 단일 협각 시야 이미지로부터 대화형으로 탐색 가능한 장면을 생성하는 실시간 비디오 월드 모델이다. 이 설정은 입력이 환경의 극히 일부만 관찰하는 반면, 대화형 로밍은 완전한 주변 세계, 지속적인 기하 구조, 제어 가능한 카메라 움직임, 시간적으로 일관된 고충실도 관측을 요구하기 때문에 어렵다. MoVerse는 세계 구축과 관측 렌더링을 분리하여 이 문제를 해결한다. 먼저 위상 인식 확산을 통해 중력 정렬 360° 파노라마로 입력을 확장하여 3D 추론 전에 누락된 시야를 채운다. 그런 다음 파노라마 인식 기하 잔차 예측을 사용하여 파노라마를 지속적인 3D 가우시안 스캐폴드로 변환하여, 조밀하고 직접 렌더링 가능한 공간 메모리를 얻는다. 마지막으로, 가우시안 조건부 비디오 렌더러가 사용자 지정 카메라 궤적을 따라 스캐폴드 렌더링을 사실적인 비디오로 변환한다. 이 렌더러를 상호작용에 실용적으로 만들기 위해, 우리는 고품질 조건부 렌더링을 위한 양방향 확산 교사를 훈련하고, 이를 제한된 지연 시간 스트리밍을 위한 인과적 자기회귀 학생으로 증류한다. 이 설계는 명시적 3D 표현의 제어 가능성과 장거리 일관성을 생성적 비디오 모델의 지각 품질과 결합한다. MoVerse는 단일 NVIDIA RTX 4090 GPU에서 8FPS의 실시간 장면 로밍을 지원하며, 대화형 비디오 출력을 통한 단일 이미지 세계 생성의 실용적인 경로를 보여준다.

English

We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360^circ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.