基于渲染器智能体推理的接地光照视频生成
Lighting-grounded Video Generation with Renderer-based Agent Reasoning
April 9, 2026
作者: Ziqi Cai, Taoyu Yang, Zheng Chang, Si Li, Han Jiang, Shuchen Weng, Boxin Shi
cs.AI
摘要
扩散模型在视频生成领域取得了显著进展,但其可控性仍是主要局限。关键场景要素(如布局、光照和摄像机轨迹)往往存在耦合效应或仅被弱建模,这限制了其在电影制作和虚拟制片等需要精确场景控制的领域的应用。我们提出LiVER——基于扩散模型的场景可控视频生成框架。通过引入新型条件生成机制,将视频合成过程与显式3D场景属性相绑定,并辅以包含物体布局、光照和摄像机参数密集标注的大规模数据集。我们的方法通过统一3维表征渲染控制信号,实现场景要素的解耦控制。提出轻量化条件调制模块与渐进式训练策略,将这些信号集成到基础视频扩散模型中,确保稳定收敛与高保真度。该框架支持包括图像到视频、视频到视频合成在内的多种应用场景,且底层3D场景可完全编辑。为提升易用性,我们开发了场景智能体,可自动将高层级用户指令转化为所需3D控制信号。实验表明,LiVER在实现最先进的光感真实度与时间一致性的同时,能对场景要素进行精确解耦控制,为可控视频生成树立了新标准。
English
Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.