VerseCrafter：具備四維幾何控制的動態真實影片世界模型

摘要

影片世界模型旨在模擬動態的真實世界環境，然而現有方法難以對攝影機與多物體運動提供統一且精確的控制，因為影片本質上是在投影的二維影像平面上運作動態。為彌合此差距，我們提出VerseCrafter——一種具備四維感知能力的影片世界模型，能在統一的四維幾何世界狀態中實現對攝影機與物體動態的顯式連貫控制。我們方法的核心在於創新的四維幾何控制表徵，透過靜態背景點雲與逐物體三維高斯軌跡來編碼世界狀態。此表徵不僅捕捉物體的運動路徑，更記錄其隨時間變化的概率性三維佔據狀態，為剛性邊界框或參數化模型提供了靈活且與類別無關的替代方案。這些四維控制信號會被渲染成預訓練影片擴散模型的條件輸入，從而生成高擬真度、視角一致且精確遵循指定動態的影片。然而，另一重大挑戰在於缺乏具備顯式四維註解的大規模訓練資料。我們為此開發了自動化資料引擎，能從真實場景影片中提取所需四維控制參數，使模型能基於海量多樣化資料集進行訓練。

English

Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.

VerseCrafter：具備四維幾何控制的動態真實影片世界模型

VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

摘要

Support