동적 뷰 합성을 역문제로서 접근하기

초록

본 연구에서는 단안 비디오로부터의 동적 뷰 합성을 훈련 없이 역문제로 접근합니다. 사전 훈련된 비디오 확산 모델의 노이즈 초기화 단계를 재설계함으로써, 가중치 업데이트나 보조 모듈 없이도 고품질의 동적 뷰 합성을 가능하게 합니다. 먼저, 제로-터미널 신호 대 잡음비(SNR) 스케줄로 인해 발생하는 결정론적 역변환의 근본적인 문제를 식별하고, 이를 해결하기 위해 K-차 재귀 노이즈 표현(K-order Recursive Noise Representation)이라는 새로운 노이즈 표현 방식을 도입합니다. 이 표현에 대한 폐쇄형 수식을 도출함으로써 VAE 인코딩된 잠재 변수와 DDIM 역변환된 잠재 변수 간의 정확하고 효율적인 정렬을 가능하게 합니다. 또한, 카메라 이동으로 인해 새롭게 보이는 영역을 합성하기 위해, 잠재 공간에서 가시성 인지 샘플링을 수행하여 가려진 영역을 완성하는 확률적 잠재 변조(Stochastic Latent Modulation)를 제안합니다. 포괄적인 실험을 통해 노이즈 초기화 단계에서 구조화된 잠재 변조를 통해 동적 뷰 합성이 효과적으로 수행될 수 있음을 입증합니다.

English

In this work, we address dynamic view synthesis from monocular videos as an inverse problem in a training-free setting. By redesigning the noise initialization phase of a pre-trained video diffusion model, we enable high-fidelity dynamic view synthesis without any weight updates or auxiliary modules. We begin by identifying a fundamental obstacle to deterministic inversion arising from zero-terminal signal-to-noise ratio (SNR) schedules and resolve it by introducing a novel noise representation, termed K-order Recursive Noise Representation. We derive a closed form expression for this representation, enabling precise and efficient alignment between the VAE-encoded and the DDIM inverted latents. To synthesize newly visible regions resulting from camera motion, we introduce Stochastic Latent Modulation, which performs visibility aware sampling over the latent space to complete occluded regions. Comprehensive experiments demonstrate that dynamic view synthesis can be effectively performed through structured latent manipulation in the noise initialization phase.