ChatPaper.aiChatPaper

VerseCrafter:具备四维几何控制的动态逼真视频世界模型

VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

January 8, 2026
作者: Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, Yanwei Fu
cs.AI

摘要

视频世界模型旨在模拟动态的真实世界环境,但现有方法难以对相机与多目标运动实现统一且精准的控制,因为视频本质上是在投影的二维图像平面上运作动态。为弥补这一差距,我们推出VerseCrafter——一个具备四维感知能力的视频世界模型,可在统一的四维几何世界状态下实现对相机和物体动态的显式连贯控制。我们的方法核心在于新颖的四维几何控制表征,该表征通过静态背景点云和逐对象三维高斯轨迹来编码世界状态。这种表征不仅能捕捉物体的运动路径,还能呈现其随时间变化的概率性三维占据情况,为刚性边界框或参数化模型提供了灵活且与类别无关的替代方案。这些四维控制被渲染为预训练视频扩散模型的条件信号,从而生成高保真度、视角一致且严格遵循指定动态的视频。然而,另一大挑战在于缺乏具有显式四维标注的大规模训练数据。我们通过开发自动数据引擎解决了这一问题,该引擎能够从自然场景视频中提取所需的四维控制,使模型得以在海量多样化数据集上进行训练。
English
Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.
PDF111January 10, 2026