動的ビュー合成を逆問題として捉える

要旨

本研究では、モノクロ動画からの動的視点合成を、学習不要な設定における逆問題として取り組む。事前学習済みのビデオ拡散モデルのノイズ初期化段階を再設計することで、重み更新や補助モジュールを一切必要とせずに高忠実度な動的視点合成を実現する。まず、ゼロ終端信号対雑音比（SNR）スケジュールに起因する決定論的逆変換の根本的な障害を特定し、これを解決するためにK次再帰的ノイズ表現と呼ばれる新たなノイズ表現を導入する。この表現の閉形式を導出し、VAEエンコードされた潜在変数とDDIM逆変換された潜在変数の間の精密かつ効率的な整合を可能にする。カメラ運動に伴って新たに可視化される領域を合成するために、潜在空間上で可視性を考慮したサンプリングを行い、遮蔽領域を補完する確率的潜在変調を導入する。包括的な実験により、ノイズ初期化段階における構造化された潜在変数操作を通じて、動的視点合成が効果的に実行できることを実証する。

English

In this work, we address dynamic view synthesis from monocular videos as an inverse problem in a training-free setting. By redesigning the noise initialization phase of a pre-trained video diffusion model, we enable high-fidelity dynamic view synthesis without any weight updates or auxiliary modules. We begin by identifying a fundamental obstacle to deterministic inversion arising from zero-terminal signal-to-noise ratio (SNR) schedules and resolve it by introducing a novel noise representation, termed K-order Recursive Noise Representation. We derive a closed form expression for this representation, enabling precise and efficient alignment between the VAE-encoded and the DDIM inverted latents. To synthesize newly visible regions resulting from camera motion, we introduce Stochastic Latent Modulation, which performs visibility aware sampling over the latent space to complete occluded regions. Comprehensive experiments demonstrate that dynamic view synthesis can be effectively performed through structured latent manipulation in the noise initialization phase.

動的ビュー合成を逆問題として捉える

Dynamic View Synthesis as an Inverse Problem

要旨

Support