RayDer: 실세계 비디오로부터 확장 가능한 자기 지도 학습 기반 신규 시점 합성

초록

자기 지도 학습 기반 신규 시점 합성(NVS)은 비디오 데이터가 풍부함에도 불구하고 확장에 어려움을 겪고 있으며, 이는 주로 실제 비디오에 대한 훈련의 취약성과 다중 네트워크 시스템 설계의 예측하기 어려운 확장 동작 때문이다. 본 연구에서는 RayDer를 제안한다. RayDer는 카메라 추정, 장면 재구성 및 렌더링을 단일 백본으로 통합한 피드포워드 트랜스포머로, 자기 지도 학습 NVS를 잘 정의된 단일 모델 확장 문제로 전환한다. 성가신 요소로 처리되는 최소한의 동적 상태는 시간에 따라 변하는 콘텐츠를 흡수하여 제약 없는 실제 비디오에서 안정적인 훈련을 가능하게 한다. 중요하게도, RayDer는 정적 장면 NVS를 목표 작업으로 유지한다. 즉, 동적 콘텐츠는 동적 장면(4D) NVS에서처럼 재구성되지 않고 순전히 확장 가능한 감독 신호로 활용된다. 여러 모델 크기와 데이터 규모에 걸쳐 RayDer는 데이터 및 컴퓨팅 자원에 대해 깔끔한 멱법칙 스케일링을 보이며, 정적 장면 데이터 혼합을 능가한다. 또한 많은 벤치마크에서 RayDer는 최신 지도 학습 접근법과 경쟁력 있는 강력한 제로샷 오픈셋 성능을 달성한다. 프로젝트 페이지: https://compvis.github.io/rayder

English

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder