장기 공간 기억을 갖춘 비디오 세계 모델

초록

새롭게 등장하는 세계 모델은 카메라 이동 및 텍스트 프롬프트와 같은 제어 신호에 대한 응답으로 비디오 프레임을 자동회귀적으로 생성한다. 제한된 시간적 컨텍스트 창 크기로 인해, 이러한 모델들은 재방문 시 장면 일관성을 유지하는 데 어려움을 겪으며, 이전에 생성된 환경을 심각하게 잊어버리는 문제가 발생한다. 인간의 기억 메커니즘에서 영감을 받아, 우리는 기하학적으로 기반을 둔 장기 공간 메모리를 통해 비디오 세계 모델의 장기적 일관성을 향상시키는 새로운 프레임워크를 제안한다. 우리의 프레임워크는 장기 공간 메모리에서 정보를 저장하고 검색하는 메커니즘을 포함하며, 명시적으로 저장된 3D 메모리 메커니즘을 갖춘 세계 모델을 훈련하고 평가하기 위해 맞춤형 데이터셋을 구축한다. 평가 결과, 관련 기준선과 비교하여 품질, 일관성 및 컨텍스트 길이에서 개선된 성능을 보여주며, 장기적 일관성을 갖춘 세계 생성으로 나아가는 길을 열어준다.

English

Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.

장기 공간 기억을 갖춘 비디오 세계 모델

Video World Models with Long-term Spatial Memory

초록

Support