ChatPaper.aiChatPaper

RayDer:來自真實世界影片的可擴展自監督新視圖合成

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

May 29, 2026
作者: Ulrich Prestel, Stefan Andreas Baumann, Nick Stracke, Björn Ommer
cs.AI

摘要

自監督新視角合成(NVS)在大規模應用上仍具挑戰性,儘管影片資料豐富,主因在於真實影片訓練的脆弱性,以及多網路系統設計難以預測的擴展行為。我們提出 RayDer,一個統一的前饋變壓器架構,將相機估計、場景重建與渲染整合至單一主幹網路,使自監督 NVS 轉變為一個適定的單模型擴展問題。透過一個被視為干擾因子的最小動態狀態,該架構能吸收時變內容,並在不受約束的真實世界影片上實現穩定訓練。關鍵在於,RayDer 始終將靜態場景 NVS 作為目標任務:動態內容僅作為可擴展的監督信號加以利用,而非如同動態場景(4D)NVS 般進行重建。在多種模型規模與跨數個數量級的資料量下,RayDer 展現出與資料及運算量一致的清晰冪律擴展行為,並優於靜態場景資料混合方案。在大量基準測試中,RayDer 達到與最先進監督方法競爭的強大零樣本開放集性能。專案頁面:https://compvis.github.io/rayder
English
Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder