시야에서 사라져도 마음에서 사라지지 않는다: 동적 비디오 세계 모델을 위한 하이브리드 메모리

초록

비디오 월드 모델은 물리 세계를 시뮬레이션하는 데 막대한 잠재력을 보여왔지만, 기존 메모리 메커니즘은 주로 환경을 정적인 캔버스로 취급합니다. 동적 객체가 시야에서 사라졌다가 다시 나타날 때, 현재의 방법론들은 종종 객체가 멈춤, 왜곡 또는 소실되는 문제를 겪습니다. 이를 해결하기 위해 우리는 하이브리드 메모리(Hybrid Memory)라는 새로운 패러다임을 제안합니다. 이는 모델이 정적 배경에 대해서는 정확한 기록 보관자 역할을, 동적 객체에 대해서는 경계하는 추적자 역할을 동시에 수행하도록 요구하여 시야 이탈 구간 동안에도 운동 연속성을 보장합니다. 이 방향의 연구를 촉진하기 위해 우리는 하이브리드 메모리에 특화된 최초의 대규모 비디오 데이터셋인 HM-World를 구축했습니다. HM-World는 분리된 카메라 및 객체 궤적을 가진 59K개의 고충실도 클립으로 구성되며, 17개의 다양한 장면, 49개의 독특한 객체, 그리고 하이브리드 일관성을 엄격히 평가하기 위해 세심하게 설계된 퇴장-재등장 이벤트를 특징으로 합니다. 더 나아가, 메모리를 토큰으로 압축하고 시공간적 관련성 기반 검색 메커니즘을 활용하는 전용 메모리 아키텍처인 HyDRA를 제안합니다. HyDRA는 관련된 운동 단서를 선택적으로 주의함으로써 숨겨진 객체의 정체성과 운동을 효과적으로 보존합니다. HM-World에 대한 광범위한 실험을 통해 우리의 방법이 동적 객체 일관성과 전체 생성 품질 모두에서 최첨단 방법론들을 크게 능가함을 입증했습니다.

English

Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.

시야에서 사라져도 마음에서 사라지지 않는다: 동적 비디오 세계 모델을 위한 하이브리드 메모리

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

초록

Support