L4GM: 大規模4Dガウシアン再構成モデル

要旨

本論文では、L4GMを紹介します。これは、単一ビデオ入力からアニメーション化されたオブジェクトを生成する初の4D大規模再構成モデルであり、わずか1秒の単一フォワードパスで処理を行います。成功の鍵は、Objaverseからキュレーションされたレンダリング済みアニメーションオブジェクトを含む多視点ビデオの新規データセットです。このデータセットは、44,000の多様なオブジェクトと110,000のアニメーションを48の視点でレンダリングし、合計12Mのビデオと300Mのフレームを提供します。スケーラビリティを考慮してL4GMをシンプルに保ち、多視点画像入力から3Dガウシアン楕円体を出力する事前学習済み3D大規模再構成モデルであるLGMの上に直接構築します。L4GMは、低フレームレートでサンプリングされたビデオフレームからフレームごとの3Dガウシアンスプラッティング表現を出力し、その後、表現を高フレームレートにアップサンプリングして時間的な滑らかさを実現します。時間的な一貫性を学習するために、ベースのLGMに時間的自己注意層を追加し、タイムステップごとの多視点レンダリング損失を利用してモデルを訓練します。表現は、中間の3Dガウシアン表現を生成する補間モデルを訓練することで、より高いフレームレートにアップサンプリングされます。L4GMは、合成データのみで訓練されているにもかかわらず、実世界のビデオに対して非常に良い汎化性能を示し、高品質なアニメーション3Dアセットを生成することが実証されています。

English

We present L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input -- in a single feed-forward pass that takes only a second. Key to our success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames. We keep our L4GM simple for scalability and build directly on top of LGM, a pretrained 3D Large Reconstruction Model that outputs 3D Gaussian ellipsoids from multiview image input. L4GM outputs a per-frame 3D Gaussian Splatting representation from video frames sampled at a low fps and then upsamples the representation to a higher fps to achieve temporal smoothness. We add temporal self-attention layers to the base LGM to help it learn consistency across time, and utilize a per-timestep multiview rendering loss to train the model. The representation is upsampled to a higher framerate by training an interpolation model which produces intermediate 3D Gaussian representations. We showcase that L4GM that is only trained on synthetic data generalizes extremely well on in-the-wild videos, producing high quality animated 3D assets.

L4GM: 大規模4Dガウシアン再構成モデル

L4GM: Large 4D Gaussian Reconstruction Model

要旨

Support