RayDer：來自真實世界影片的可擴展自監督新視圖合成

摘要

自監督新視角合成（NVS）在大規模應用上仍具挑戰性，儘管影片資料豐富，主因在於真實影片訓練的脆弱性，以及多網路系統設計難以預測的擴展行為。我們提出 RayDer，一個統一的前饋變壓器架構，將相機估計、場景重建與渲染整合至單一主幹網路，使自監督 NVS 轉變為一個適定的單模型擴展問題。透過一個被視為干擾因子的最小動態狀態，該架構能吸收時變內容，並在不受約束的真實世界影片上實現穩定訓練。關鍵在於，RayDer 始終將靜態場景 NVS 作為目標任務：動態內容僅作為可擴展的監督信號加以利用，而非如同動態場景（4D）NVS 般進行重建。在多種模型規模與跨數個數量級的資料量下，RayDer 展現出與資料及運算量一致的清晰冪律擴展行為，並優於靜態場景資料混合方案。在大量基準測試中，RayDer 達到與最先進監督方法競爭的強大零樣本開放集性能。專案頁面：https://compvis.github.io/rayder

English

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder