WildRayZer: Sintesi di Viste Ampie Auto-supervisionata in Ambienti Dinamici

Abstract

Presentiamo WildRayZer, un framework auto-supervisionato per la sintesi di nuove viste (NVS) in ambienti dinamici in cui si muovono sia la telecamera che gli oggetti. I contenuti dinamici infrangono la consistenza multi-vista su cui si basano i modelli NVS statici, portando a effetti di ghosting, geometrie allucinate e stime della posa instabili. WildRayZer affronta questo problema eseguendo un test di analisi per sintesi: un renderer statico che considera solo il movimento della telecamera spiega la struttura rigida, e i suoi residui rivelano le regioni transitorie. Da questi residui, costruiamo maschere di movimento pseudo, distilliamo uno stimatore del movimento e lo utilizziamo per mascherare i token di input e regolare i gradienti della loss, in modo che la supervisione si concentri sul completamento dello sfondo tra viste diverse. Per abilitare addestramento e valutazione su larga scala, abbiamo curato Dynamic RealEstate10K (D-RE10K), un dataset del mondo reale di 15K sequenze dinamiche acquisite in modo casuale, e D-RE10K-iPhone, un benchmark associato per NVS sparse-view con consapevolezza dei transitori, contenente coppie di viste transitorie e pulite. Gli esperimenti mostrano che WildRayZer supera costantemente i metodi baseline basati su ottimizzazione e feed-forward sia nella rimozione delle regioni transitorie che nella qualità NVS a frame completo, con una singola passata feed-forward.

English

We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, leading to ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.

WildRayZer: Sintesi di Viste Ampie Auto-supervisionata in Ambienti Dinamici

WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments

Abstract

Support