潛隱猶存：動態影片世界模型的混合記憶架構

摘要

影片世界模型已展現出模擬物理世界的巨大潛力，但現有的記憶機制主要將環境視為靜態畫布。當動態主體暫時離開視線後重新出現時，現有方法往往難以應對，導致主體出現凍結、扭曲或消失等問題。為解決此難題，我們提出混合記憶（Hybrid Memory）新範式，要求模型同時具備靜態背景的精準歸檔能力與動態主體的敏銳追蹤能力，確保主體在離開視野期間的運動連續性。為推動此方向研究，我們構建了首個專注於混合記憶的大規模影片數據集HM-World，包含5.9萬個高擬真度片段，其特點在於解耦的攝影機與主體運動軌跡，涵蓋17類多樣化場景、49種不同主體，並精心設計了用於嚴格評估混合連貫性的出入場事件。此外，我們提出專用記憶架構HyDRA，通過將記憶壓縮為符元並採用時空關聯性驅動的檢索機制，選擇性關注相關運動線索，有效保持隱藏主體的身份特徵與運動軌跡。在HM-World上的大量實驗表明，我們的方法在動態主體連貫性與整體生成品質方面均顯著優於現有頂尖技術。

English

Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.

潛隱猶存：動態影片世界模型的混合記憶架構

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

摘要

Support