CVD-STORM:面向自动驾驶的跨视角视频扩散与时空重建模型
CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving
October 9, 2025
作者: Tianrui Zhang, Yichen Liu, Zilin Guo, Yuxin Guo, Jingcheng Ni, Chenjing Ding, Dan Xu, Lewei Lu, Zehuan Wu
cs.AI
摘要
生成模型已广泛应用于环境模拟和未来状态预测的世界建模中。随着自动驾驶技术的进步,不仅对在各种控制下生成高保真视频的需求日益增长,而且对生成如深度估计等多样且有意义的信息也提出了更高要求。为此,我们提出了CVD-STORM,一种跨视角视频扩散模型,它利用时空重建变分自编码器(VAE),能够在多种控制输入下生成具有4D重建能力的长期多视角视频。我们的方法首先通过辅助的4D重建任务对VAE进行微调,增强其编码3D结构和时间动态的能力。随后,我们将该VAE集成到视频扩散过程中,显著提升了生成质量。实验结果表明,我们的模型在FID和FVD指标上均取得了显著提升。此外,联合训练的高斯泼溅解码器有效地重建了动态场景,为全面场景理解提供了宝贵的几何信息。
English
Generative models have been widely applied to world modeling for environment
simulation and future state prediction. With advancements in autonomous
driving, there is a growing demand not only for high-fidelity video generation
under various controls, but also for producing diverse and meaningful
information such as depth estimation. To address this, we propose CVD-STORM, a
cross-view video diffusion model utilizing a spatial-temporal reconstruction
Variational Autoencoder (VAE) that generates long-term, multi-view videos with
4D reconstruction capabilities under various control inputs. Our approach first
fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its
ability to encode 3D structures and temporal dynamics. Subsequently, we
integrate this VAE into the video diffusion process to significantly improve
generation quality. Experimental results demonstrate that our model achieves
substantial improvements in both FID and FVD metrics. Additionally, the
jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic
scenes, providing valuable geometric information for comprehensive scene
understanding.