魔法世界：几何驱动的交互式视频空间探索

摘要

近期交互式视频世界模型能够根据用户指令生成场景演化内容。虽然取得了令人瞩目的成果，但仍存在两个关键局限：其一，未能充分利用指令驱动的场景运动与底层三维几何的对应关系，导致视角变化时出现结构失稳；其二，在多步交互过程中容易遗忘历史信息，引发错误累积及场景语义与结构的渐进式漂移。为此，我们提出MagicWorld模型，该交互式视频世界模型融合了三维几何先验与历史检索机制。MagicWorld从单张场景图像出发，通过用户动作驱动动态场景演化，以自回归方式合成连续场景。我们引入动作引导三维几何模块（AG3D），该模块基于每次交互的首帧图像及对应动作构建点云，为视角转换提供显式几何约束，从而提升结构一致性。进一步提出历史缓存检索（HCR）机制，在生成过程中检索相关历史帧并将其作为条件信号注入，辅助模型利用过往场景信息以减轻错误累积。实验结果表明，MagicWorld在交互迭代过程中显著提升了场景稳定性和连续性。

English

Recent interactive video world model methods generate scene evolution conditioned on user instructions. Although they achieve impressive results, two key limitations remain. First, they fail to fully exploit the correspondence between instruction-driven scene motion and the underlying 3D geometry, which results in structural instability under viewpoint changes. Second, they easily forget historical information during multi-step interaction, resulting in error accumulation and progressive drift in scene semantics and structure. To address these issues, we propose MagicWorld, an interactive video world model that integrates 3D geometric priors and historical retrieval. MagicWorld starts from a single scene image, employs user actions to drive dynamic scene evolution, and autoregressively synthesizes continuous scenes. We introduce the Action-Guided 3D Geometry Module (AG3D), which constructs a point cloud from the first frame of each interaction and the corresponding action, providing explicit geometric constraints for viewpoint transitions and thereby improving structural consistency. We further propose History Cache Retrieval (HCR) mechanism, which retrieves relevant historical frames during generation and injects them as conditioning signals, helping the model utilize past scene information and mitigate error accumulation. Experimental results demonstrate that MagicWorld achieves notable improvements in scene stability and continuity across interaction iterations.