4D-LRM：任意の視点と任意の時間における大規模時空間再構成モデル

要旨

4D事前学習をスケールアップし、特定の時間における少数の視点から物体を再構築し、任意の視点と時間での表現を学習する一般的な時空間表現を獲得できるだろうか？私たちは、4D-LRMによってこの問いに肯定的な回答を提供する。4D-LRMは、制約のない視点とタイムスタンプからの入力を取り、任意の新しい視点と時間の組み合わせをレンダリングする初の大規模4D再構築モデルである。従来の4Dアプローチ（最適化ベース、幾何学ベース、生成モデルなど）が効率性、汎用性、忠実性に苦戦していたのに対し、4D-LRMは統一された時空間表現を学習し、時間を跨いだポーズ付き画像トークンからピクセルごとの4Dガウシアンプリミティブを直接予測することで、原理的には無限のフレームレートでの高速かつ高品質なレンダリングを可能にする。私たちの結果は、時空間事前学習をスケールアップすることが、正確で効率的な4D再構築を可能にすることを示している。4D-LRMは新しい物体への汎化、時間を跨いだ補間、多様なカメラ設定の処理が可能であり、単一のA100 GPU上で1.5秒未満で24フレームのシーケンスを1回のフォワードパスで再構築する。

English

Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary novel view-time combinations. Unlike prior 4D approaches, e.g., optimization-based, geometry-based, or generative, that struggle with efficiency, generalization, or faithfulness, 4D-LRM learns a unified space-time representation and directly predicts per-pixel 4D Gaussian primitives from posed image tokens across time, enabling fast, high-quality rendering at, in principle, infinite frame rate. Our results demonstrate that scaling spatiotemporal pretraining enables accurate and efficient 4D reconstruction. We show that 4D-LRM generalizes to novel objects, interpolates across time, and handles diverse camera setups. It reconstructs 24-frame sequences in one forward pass with less than 1.5 seconds on a single A100 GPU.

4D-LRM：任意の視点と任意の時間における大規模時空間再構成モデル

4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time

要旨

Support