RayDer：来自真实世界视频的可扩展自监督新颖视图合成

摘要

自监督新视角合成（NVS）在视频数据丰富的背景下仍难以规模化扩展，主要原因在于真实场景视频训练的脆弱性以及多网络系统设计中难以预测的扩展行为。我们提出 RayDer，一种统一的、前馈式 Transformer 模型，将相机估计、场景重建与渲染整合至单一骨干网络，从而将自监督 NVS 转化为一个良态的单模型扩展问题。通过将最小化的动态状态视为干扰因素，该模型能够吸收时序变化内容，并实现针对无约束真实世界视频的稳定训练。关键在于，RayDer 始终以静态场景 NVS 作为目标任务：动态内容仅作为可扩展监督信号被利用，而非像动态场景（4D）NVS 那样进行重建。在多种模型规模及跨数量级的数据范围内，RayDer 展现出清晰的数据与算力幂律扩展特性，并优于静态场景数据混合方案。在众多基准测试中，RayDer 取得了与前沿监督方法相匹敌的强泛化零样本开放集性能。项目页面：https://compvis.github.io/rayder

English

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder