從單眼視頻合成動態視角的擴散先驗

摘要

動態新視角合成旨在捕捉影片中視覺內容的時間演變。現有方法在區分運動和結構方面存在困難，特別是在相機姿勢相對於物體運動的情況下，相機姿勢要麼未知，要麼受限。此外，僅從參考圖像獲取信息，極具挑戰性地去幻想在給定影片中被遮擋或部分觀察到的未見區域。為解決這些問題，我們首先使用自定義技術在影片幀上對預訓練的RGB-D擴散模型進行微調。隨後，我們將從微調模型中提煉知識，轉化為包含動態和靜態神經輻射場（NeRF）組件的4D表示。所提出的流程在保留場景身份的同時實現了幾何一致性。我們進行了深入的實驗，從質量和量化方面評估了所提方法的有效性。我們的結果展示了我們方法在具有挑戰性情況下的穩健性和實用性，進一步推動了動態新視角合成的發展。

English

Dynamic novel view synthesis aims to capture the temporal evolution of visual content within videos. Existing methods struggle to distinguishing between motion and structure, particularly in scenarios where camera poses are either unknown or constrained compared to object motion. Furthermore, with information solely from reference images, it is extremely challenging to hallucinate unseen regions that are occluded or partially observed in the given videos. To address these issues, we first finetune a pretrained RGB-D diffusion model on the video frames using a customization technique. Subsequently, we distill the knowledge from the finetuned model to a 4D representations encompassing both dynamic and static Neural Radiance Fields (NeRF) components. The proposed pipeline achieves geometric consistency while preserving the scene identity. We perform thorough experiments to evaluate the efficacy of the proposed method qualitatively and quantitatively. Our results demonstrate the robustness and utility of our approach in challenging cases, further advancing dynamic novel view synthesis.

從單眼視頻合成動態視角的擴散先驗

Diffusion Priors for Dynamic View Synthesis from Monocular Videos

摘要

Support