4D-LRM: 임의의 시점과 시공간에서의 대규모 시공간 재구성 모델

초록

4D 사전 학습을 확장하여 특정 시간의 몇 개의 뷰에서 객체를 재구성하고 임의의 시간에 임의의 뷰를 생성할 수 있는 일반적인 시공간 표현을 학습할 수 있을까? 우리는 이에 대해 긍정적인 답을 제시하며, 제약 없는 뷰와 타임스탬프에서 입력을 받아 임의의 새로운 뷰-시간 조합을 렌더링하는 최초의 대규모 4D 재구성 모델인 4D-LRM을 소개한다. 기존의 최적화 기반, 기하학 기반 또는 생성적 접근법과 달리 효율성, 일반화 능력 또는 정확성에 어려움을 겪는 반면, 4D-LRM은 통합된 시공간 표현을 학습하고 시간에 걸친 포즈된 이미지 토큰으로부터 픽셀 단위의 4D 가우시안 프리미티브를 직접 예측함으로써 원칙적으로 무한 프레임 속도에서 빠르고 고품질의 렌더링을 가능하게 한다. 우리의 결과는 시공간 사전 학습을 확장함으로써 정확하고 효율적인 4D 재구성이 가능함을 보여준다. 4D-LRM은 새로운 객체에 일반화되고 시간을 보간하며 다양한 카메라 설정을 처리할 수 있다. 이 모델은 단일 A100 GPU에서 1.5초 미만의 시간으로 24프레임 시퀀스를 한 번의 순방향 전달로 재구성한다.

English

Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary novel view-time combinations. Unlike prior 4D approaches, e.g., optimization-based, geometry-based, or generative, that struggle with efficiency, generalization, or faithfulness, 4D-LRM learns a unified space-time representation and directly predicts per-pixel 4D Gaussian primitives from posed image tokens across time, enabling fast, high-quality rendering at, in principle, infinite frame rate. Our results demonstrate that scaling spatiotemporal pretraining enables accurate and efficient 4D reconstruction. We show that 4D-LRM generalizes to novel objects, interpolates across time, and handles diverse camera setups. It reconstructs 24-frame sequences in one forward pass with less than 1.5 seconds on a single A100 GPU.

4D-LRM: 임의의 시점과 시공간에서의 대규모 시공간 재구성 모델

4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time

초록

Support