4D-LRM:從任意視角與時間點進行大規模時空重建的模型
4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time
June 23, 2025
作者: Ziqiao Ma, Xuweiyi Chen, Shoubin Yu, Sai Bi, Kai Zhang, Chen Ziwen, Sihan Xu, Jianing Yang, Zexiang Xu, Kalyan Sunkavalli, Mohit Bansal, Joyce Chai, Hao Tan
cs.AI
摘要
我們能否擴展四維預訓練,以學習通用的時空表徵,從而從某些時刻的少數視角重建物體,並在任何時刻呈現任意視角?我們以4D-LRM給出了肯定的答案,這是首個大規模四維重建模型,能夠處理來自無約束視角和時間戳的輸入,並渲染任意新穎的視角-時間組合。與以往基於優化、幾何或生成的四維方法不同,這些方法在效率、泛化性或保真度方面存在挑戰,而4D-LRM學習了一個統一的時空表徵,並直接從跨時間的姿態圖像標記預測每像素的四維高斯基元,從而實現了理論上無限幀率的快速、高質量渲染。我們的結果表明,擴展時空預訓練能夠實現精確且高效的四維重建。我們展示了4D-LRM能夠泛化到新物體、在時間上進行插值,並處理多樣的相機設置。它在單次前向傳播中重建24幀序列,在單個A100 GPU上耗時不到1.5秒。
English
Can we scale 4D pretraining to learn general space-time representations that
reconstruct an object from a few views at some times to any view at any time?
We provide an affirmative answer with 4D-LRM, the first large-scale 4D
reconstruction model that takes input from unconstrained views and timestamps
and renders arbitrary novel view-time combinations. Unlike prior 4D approaches,
e.g., optimization-based, geometry-based, or generative, that struggle with
efficiency, generalization, or faithfulness, 4D-LRM learns a unified space-time
representation and directly predicts per-pixel 4D Gaussian primitives from
posed image tokens across time, enabling fast, high-quality rendering at, in
principle, infinite frame rate. Our results demonstrate that scaling
spatiotemporal pretraining enables accurate and efficient 4D reconstruction. We
show that 4D-LRM generalizes to novel objects, interpolates across time, and
handles diverse camera setups. It reconstructs 24-frame sequences in one
forward pass with less than 1.5 seconds on a single A100 GPU.