動態視圖合成作為一個逆問題

摘要

在本研究中，我們將單目視頻中的動態視角合成作為一個無訓練設置下的逆問題來處理。通過重新設計預訓練視頻擴散模型的噪聲初始化階段，我們實現了無需權重更新或輔助模塊的高保真動態視角合成。我們首先識別了由零終端信噪比（SNR）調度引起的確定性反演的基本障礙，並通過引入一種名為K階遞歸噪聲表示的新噪聲表示方法來解決這一問題。我們推導出了該表示的閉式表達式，從而實現了VAE編碼與DDIM反演潛變量之間的精確高效對齊。為了合成由相機運動產生的新可見區域，我們引入了隨機潛變調制，該方法在潛變空間上執行可見性感知採樣以完成被遮擋區域。綜合實驗表明，通過在噪聲初始化階段進行結構化的潛變量操作，可以有效地執行動態視角合成。

English

In this work, we address dynamic view synthesis from monocular videos as an inverse problem in a training-free setting. By redesigning the noise initialization phase of a pre-trained video diffusion model, we enable high-fidelity dynamic view synthesis without any weight updates or auxiliary modules. We begin by identifying a fundamental obstacle to deterministic inversion arising from zero-terminal signal-to-noise ratio (SNR) schedules and resolve it by introducing a novel noise representation, termed K-order Recursive Noise Representation. We derive a closed form expression for this representation, enabling precise and efficient alignment between the VAE-encoded and the DDIM inverted latents. To synthesize newly visible regions resulting from camera motion, we introduce Stochastic Latent Modulation, which performs visibility aware sampling over the latent space to complete occluded regions. Comprehensive experiments demonstrate that dynamic view synthesis can be effectively performed through structured latent manipulation in the noise initialization phase.