V-RGBX：映像編集における固有特性の精密制御

要旨

大規模動画生成モデルは、実世界シーンにおける写実的な外観と照明相互作用のモデリングにおいて顕著な可能性を示しています。しかし、本質的なシーン特性（アルベド、法線、材質、放射照度など）を共同で理解し、それらを動画合成に活用し、さらに編集可能な本質的表現をサポートする閉ループフレームワークは未開拓の領域です。本論文では、本質的属性を考慮した初のエンドツーエンド動画編集フレームワーク「V-RGBX」を提案します。V-RGBXは以下の3つの核心機能を統合しています：(1) 本質的チャネルへの動画逆レンダリング、(2) これらの本質的表現からの写実的動画合成、(3) 本質的チャネルに条件付けられたキーフレームベース動画編集。V-RGBXの中核には、インターリーブ条件付けメカニズムを採用し、ユーザー選択キーフレームを通じた直感的で物理的に妥当な動画編集を実現し、あらゆる本質的モダリティの柔軟な操作をサポートします。大規模な定性・定量的評価により、V-RGBXが時間的一貫性のある写実的動画を生成しつつ、キーフレーム編集を物理的に妥当な方法でシーケンス全体に伝播できることを実証しました。物体外観編集やシーンレベルの再照明を含む多様な応用において、従来手法を凌駕する有効性を示しています。

English

Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.

V-RGBX：映像編集における固有特性の精密制御

V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

要旨

Support