ChatPaper.aiChatPaper

L4GM:大型4D高斯重建模型

L4GM: Large 4D Gaussian Reconstruction Model

June 14, 2024
作者: Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xiaohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, Huan Ling
cs.AI

摘要

我們提出了L4GM,這是第一個4D大型重建模型,可以從單視角視頻輸入中產生動畫物體,並且僅需一個前向傳遞過程,只需一秒鐘。我們成功的關鍵在於一個新穎的數據集,其中包含來自Objaverse的經過精心策劃和渲染的多視角視頻,該數據集展示了44K個不同的物體,具有48個視角中渲染的110K個動畫,總共生成了12M個視頻,包含300M幀。為了實現可擴展性,我們保持了L4GM的簡單性,並直接在預先訓練的3D大型重建模型LGM的基礎上進行構建,該模型可以從多視角圖像輸入中輸出3D高斯橢圓體。L4GM從以低幀率採樣的視頻幀中輸出每幀的3D高斯擴散表示,然後將表示升頻取得更高的幀率以實現時間平滑。我們在基礎LGM中添加了時間自注意力層,以幫助其學習時間上的一致性,並利用每個時間步的多視角渲染損失來訓練模型。通過訓練一個插值模型將表示升頻到更高的幀率,該模型產生中間的3D高斯表示。我們展示了只在合成數據上訓練的L4GM在野外視頻上具有極好的泛化能力,可以生成高質量的動畫3D資產。
English
We present L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input -- in a single feed-forward pass that takes only a second. Key to our success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames. We keep our L4GM simple for scalability and build directly on top of LGM, a pretrained 3D Large Reconstruction Model that outputs 3D Gaussian ellipsoids from multiview image input. L4GM outputs a per-frame 3D Gaussian Splatting representation from video frames sampled at a low fps and then upsamples the representation to a higher fps to achieve temporal smoothness. We add temporal self-attention layers to the base LGM to help it learn consistency across time, and utilize a per-timestep multiview rendering loss to train the model. The representation is upsampled to a higher framerate by training an interpolation model which produces intermediate 3D Gaussian representations. We showcase that L4GM that is only trained on synthetic data generalizes extremely well on in-the-wild videos, producing high quality animated 3D assets.

Summary

AI-Generated Summary

PDF131December 6, 2024