動態視圖合成作為一個逆問題
Dynamic View Synthesis as an Inverse Problem
June 9, 2025
作者: Hidir Yesiltepe, Pinar Yanardag
cs.AI
摘要
在本研究中,我們將單目視頻中的動態視角合成作為一個無訓練設置下的逆問題來處理。通過重新設計預訓練視頻擴散模型的噪聲初始化階段,我們實現了無需權重更新或輔助模塊的高保真動態視角合成。我們首先識別了由零終端信噪比(SNR)調度引起的確定性反演的基本障礙,並通過引入一種名為K階遞歸噪聲表示的新噪聲表示方法來解決這一問題。我們推導出了該表示的閉式表達式,從而實現了VAE編碼與DDIM反演潛變量之間的精確高效對齊。為了合成由相機運動產生的新可見區域,我們引入了隨機潛變調制,該方法在潛變空間上執行可見性感知採樣以完成被遮擋區域。綜合實驗表明,通過在噪聲初始化階段進行結構化的潛變量操作,可以有效地執行動態視角合成。
English
In this work, we address dynamic view synthesis from monocular videos as an
inverse problem in a training-free setting. By redesigning the noise
initialization phase of a pre-trained video diffusion model, we enable
high-fidelity dynamic view synthesis without any weight updates or auxiliary
modules. We begin by identifying a fundamental obstacle to deterministic
inversion arising from zero-terminal signal-to-noise ratio (SNR) schedules and
resolve it by introducing a novel noise representation, termed K-order
Recursive Noise Representation. We derive a closed form expression for this
representation, enabling precise and efficient alignment between the
VAE-encoded and the DDIM inverted latents. To synthesize newly visible regions
resulting from camera motion, we introduce Stochastic Latent Modulation, which
performs visibility aware sampling over the latent space to complete occluded
regions. Comprehensive experiments demonstrate that dynamic view synthesis can
be effectively performed through structured latent manipulation in the noise
initialization phase.