レンダラーベースのエージェント推論による照明接地型ビデオ生成

要旨

拡散モデルは動画生成において顕著な進歩を遂げているが、その制御性は依然として大きな課題である。レイアウト、照明、カメラ軌道といった重要なシーン要素は、しばしば絡み合っていたり弱くモデル化されたりしており、明示的なシーン制御が不可欠な映画制作や仮想制作などの分野での応用を制限している。本研究では、シーン制御可能な動画生成のための拡散ベースのフレームワークLiVERを提案する。これを実現するため、明示的な3Dシーン属性に基づいて動画合成を条件付ける新規フレームワークを開発し、物体レイアウト、照明、カメラパラメータの密な注釈を持つ新たな大規模データセットで支援する。本手法は、統一された3D表現から制御信号をレンダリングすることでこれらの属性を分離する。軽量な条件付けモジュールと段階的学習戦略を提案し、これらの信号を基盤となる動画拡散モデルに統合することで、安定した収束と高忠実度を確保する。本フレームワークは、基盤となる3Dシーンが完全に編集可能な画像-to-動画や動画-to-動画合成など、幅広い応用を可能とする。さらに使いやすさを高めるため、高レベルなユーザー指示を必要な3D制御信号に自動変換するシーンエージェントを開発する。実験により、LiVERが従来手法を上回る写実性と時間的一貫性を達成しつつ、シーン要素に対する精密な分離制御を実現し、制御可能な動画生成の新たな基準を確立することを示す。

English

Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.

レンダラーベースのエージェント推論による照明接地型ビデオ生成

Lighting-grounded Video Generation with Renderer-based Agent Reasoning

要旨

Support