RayDer: 実世界の動画からのスケーラブルな自己教師あり新規視点合成

要旨

自己教師あり新規視点合成（NVS）は、動画データの豊富さにもかかわらず、現実の動画に基づく学習の脆弱性や、複数ネットワークシステム設計におけるスケーリング挙動の予測困難さにより、スケールアップが依然として困難である。本稿では、カメラ推定、シーン再構成、レンダリングを単一のバックボーンに統合した統一型フィードフォワードトランスフォーマーであるRayDerを提案する。これにより、自己教師ありNVSは適切に設定された単一モデルのスケーリング問題へと転換される。最小限の動的状態を外乱因子として扱うことで、時間変動するコンテンツを吸収し、制約のない実世界動画での安定した学習を可能とする。重要な点として、RayDerは静的シーンNVSを目標タスクとして維持する：動的コンテンツは、動的シーン（4D）NVSのように再構成されるのではなく、スケーラブルな教師信号としてのみ活用される。複数のモデルサイズとデータの桁違いの規模において、RayDerはデータおよび計算量に対して明確な冪乗則スケーリングを示し、静的シーンデータ混合を凌駕する。多数のベンチマークにおいて、RayDerは最先端の教師あり手法と競合する強力なゼロショット・オープンセット性能を達成する。プロジェクトページ: https://compvis.github.io/rayder

English

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder