基于渲染器代理推理的接地光照视频生成

摘要

扩散模型在视频生成领域取得了显著进展，但其可控性仍是主要瓶颈。关键场景要素（如布局、光照和摄像机轨迹）往往存在耦合效应或建模薄弱问题，这限制了其在电影制作和虚拟制片等需要精确场景控制的领域的应用。我们提出LiVER——基于扩散模型的场景可控视频生成框架。通过引入新型条件生成机制，将视频合成过程与显式3D场景属性相绑定，并依托带有密集物体布局、光照及摄像机参数标注的大规模数据集实现。我们的方法通过统一3维表征渲染控制信号，实现场景要素的解耦控制。提出轻量化条件调制模块与渐进式训练策略，将控制信号融入基础视频扩散模型，确保稳定收敛与高保真度。该框架支持包括图像/视频到视频合成在内的多种应用场景，且支持底层三维场景的全面编辑。为提升易用性，我们还开发了场景智能体，可自动将高级用户指令转化为所需的三维控制信号。实验表明，LiVER在实现最先进的光感真实度与时序一致性的同时，能够对场景要素进行精确解耦控制，为可控视频生成建立了新标准。

English

Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.