基于稀疏扩散与三维渲染的静态场景高效相机控制视频生成
Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering
January 14, 2026
作者: Jieying Chen, Jeffrey Hu, Joan Lasenby, Ayush Tewari
cs.AI
摘要
基于扩散模型的现代视频生成模型能够制作出非常逼真的视频片段,但其计算效率低下,通常需要数分钟GPU时间才能生成几秒视频。这种低效性对生成式视频在需要实时交互的应用(如具身人工智能和VR/AR)中的部署构成了关键障碍。本文探索了一种静态场景下相机条件化视频生成的新策略:使用基于扩散的生成模型生成稀疏关键帧集合,然后通过三维重建与渲染技术合成完整视频。通过将关键帧提升为三维表征并渲染中间视角,我们的方法在确保几何一致性的同时,将生成成本分摊至数百帧画面。我们进一步提出一种能根据给定相机轨迹预测最优关键帧数量的模型,使系统能够自适应分配计算资源。最终实现的SRENDER方法对简单轨迹使用极稀疏关键帧,对复杂相机运动则采用更密集的关键帧。该方法在生成20秒视频时比基于扩散的基线模型快40倍以上,同时保持高视觉保真度和时间稳定性,为高效可控的视频合成提供了实用路径。
English
Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video. This inefficiency poses a critical barrier to deploying generative video in applications that require real-time interactions, such as embodied AI and VR/AR. This paper explores a new strategy for camera-conditioned video generation of static scenes: using diffusion-based generative models to generate a sparse set of keyframes, and then synthesizing the full video through 3D reconstruction and rendering. By lifting keyframes into a 3D representation and rendering intermediate views, our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency. We further introduce a model that predicts the optimal number of keyframes for a given camera trajectory, allowing the system to adaptively allocate computation. Our final method, SRENDER, uses very sparse keyframes for simple trajectories and denser ones for complex camera motion. This results in video generation that is more than 40 times faster than the diffusion-based baseline in generating 20 seconds of video, while maintaining high visual fidelity and temporal stability, offering a practical path toward efficient and controllable video synthesis.