潜隐于形，存念于心：动态视频世界模型的混合记忆机制

摘要

视频世界模型在模拟物理世界方面展现出巨大潜力，但现有记忆机制主要将环境视为静态画布。当动态目标暂时离开视野后重新出现时，当前方法往往难以应对，导致目标出现冻结、畸变或消失现象。为此，我们提出混合记忆新范式，要求模型同时具备静态背景的精准归档能力与动态目标的持续追踪能力，确保目标在离场期间的运动连续性。为推进该方向研究，我们构建了首个面向混合记忆的大规模视频数据集HM-World，包含5.9万条高保真片段，其相机与目标运动轨迹完全解耦，涵盖17类多样化场景、49种不同目标，并精心设计了出入场事件以严格评估混合连贯性。此外，我们提出专用记忆架构HyDRA，通过将记忆压缩为表征单元并采用时空相关性驱动的检索机制，选择性关注相关运动线索，有效保持隐藏目标的身份特征与运动轨迹。在HM-World上的大量实验表明，本方法在动态目标一致性与整体生成质量上均显著超越现有先进方法。

English

Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.

潜隐于形，存念于心：动态视频世界模型的混合记忆机制

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

摘要

Support