単眼動画からの動的視点合成のための拡散事前分布

要旨

動的な新規視点合成は、ビデオ内の視覚的コンテンツの時間的変化を捉えることを目的としています。既存の手法では、特にカメラポーズが未知であるか、物体の動きに比べて制約されているシナリオにおいて、動きと構造を区別することが困難です。さらに、参照画像からの情報のみでは、与えられたビデオで隠蔽されているか部分的に観察されている未見の領域を推測することは極めて困難です。これらの課題に対処するため、まず事前学習済みのRGB-D拡散モデルをカスタマイズ技術を用いてビデオフレームに微調整します。その後、微調整されたモデルから、動的および静的なNeural Radiance Fields（NeRF）コンポーネントを含む4D表現へと知識を蒸留します。提案されたパイプラインは、シーンの同一性を保ちながら幾何学的な一貫性を実現します。提案手法の有効性を定性的および定量的に評価するために徹底的な実験を行います。結果は、挑戦的なケースにおいても提案手法の堅牢性と有用性を示し、動的な新規視点合成をさらに進展させます。

English

Dynamic novel view synthesis aims to capture the temporal evolution of visual content within videos. Existing methods struggle to distinguishing between motion and structure, particularly in scenarios where camera poses are either unknown or constrained compared to object motion. Furthermore, with information solely from reference images, it is extremely challenging to hallucinate unseen regions that are occluded or partially observed in the given videos. To address these issues, we first finetune a pretrained RGB-D diffusion model on the video frames using a customization technique. Subsequently, we distill the knowledge from the finetuned model to a 4D representations encompassing both dynamic and static Neural Radiance Fields (NeRF) components. The proposed pipeline achieves geometric consistency while preserving the scene identity. We perform thorough experiments to evaluate the efficacy of the proposed method qualitatively and quantitatively. Our results demonstrate the robustness and utility of our approach in challenging cases, further advancing dynamic novel view synthesis.

単眼動画からの動的視点合成のための拡散事前分布

Diffusion Priors for Dynamic View Synthesis from Monocular Videos

要旨

Support