VMem: 서펠 인덱싱 뷰 메모리를 활용한 일관적인 인터랙티브 비디오 장면 생성

초록

우리는 환경을 인터랙티브하게 탐색할 수 있는 비디오 생성기를 구축하기 위한 새로운 메모리 메커니즘을 제안한다. 이전에는 장면의 2D 뷰를 아웃페인팅하면서 점진적으로 3D 기하구조를 재구성하는 방식으로 유사한 결과를 달성했지만, 이 방법은 빠르게 오차가 누적되거나, 짧은 컨텍스트 윈도우를 가진 비디오 생성기를 사용하여 장기간 장면 일관성을 유지하는 데 어려움을 겪었다. 이러한 한계를 해결하기 위해, 우리는 Surfel-Indexed View Memory(VMem)를 도입한다. 이 메커니즘은 과거 뷰를 3D 표면 요소(surfels)를 기반으로 기하학적으로 인덱싱하여 기억한다. VMem은 새로운 뷰를 생성할 때 가장 관련성이 높은 과거 뷰를 효율적으로 검색할 수 있게 한다. 이러한 관련 뷰에만 초점을 맞춤으로써, 우리의 방법은 모든 과거 뷰를 컨텍스트로 사용하는 것보다 훨씬 적은 계산 비용으로 상상된 환경의 일관된 탐색을 생성한다. 우리는 이 접근 방식을 도전적인 장기 장면 합성 벤치마크에서 평가하고, 장면 일관성과 카메라 제어 측면에서 기존 방법 대비 우수한 성능을 입증한다.

English

We propose a novel memory mechanism to build video generators that can explore environments interactively. Similar results have previously been achieved by out-painting 2D views of the scene while incrementally reconstructing its 3D geometry, which quickly accumulates errors, or by video generators with a short context window, which struggle to maintain scene coherence over the long term. To address these limitations, we introduce Surfel-Indexed View Memory (VMem), a mechanism that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed. VMem enables the efficient retrieval of the most relevant past views when generating new ones. By focusing only on these relevant views, our method produces consistent explorations of imagined environments at a fraction of the computational cost of using all past views as context. We evaluate our approach on challenging long-term scene synthesis benchmarks and demonstrate superior performance compared to existing methods in maintaining scene coherence and camera control.

VMem: 서펠 인덱싱 뷰 메모리를 활용한 일관적인 인터랙티브 비디오 장면 생성

VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

초록

Support