时空解耦:视频生成中时间与相机位点的独立控制
BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
December 4, 2025
作者: Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, Gordon Wetzstein
cs.AI
摘要
当前新兴的视频扩散模型虽能实现高视觉保真度,却将场景动态与摄像机运动深度耦合,限制了其对时空要素的精确控制能力。我们提出了一种具备四维可控性的视频扩散框架,通过显式解耦场景动态与摄像机位姿,实现对场景动态和摄像机视角的细粒度操控。该框架以连续的世界-时间序列和摄像机轨迹作为条件输入,通过注意力层中的四维位置编码及特征调制的自适应归一化技术,将其注入视频扩散模型。为训练该模型,我们构建了时间变化与摄像机参数独立编码的独特数据集,该数据集将公开共享。实验表明,我们的模型能在多样化时间模式与摄像机轨迹下实现稳健的真实世界四维控制,在保持高生成质量的同时,其可控性优于现有方法。视频结果请参见项目网站:https://19reborn.github.io/Bullet4D/
English
Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: https://19reborn.github.io/Bullet4D/