L4GM:大型4D高斯重建模型
L4GM: Large 4D Gaussian Reconstruction Model
June 14, 2024
作者: Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xiaohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, Huan Ling
cs.AI
摘要
我们提出了L4GM,这是第一个4D大型重建模型,可以从单视角视频输入中生成动画对象 - 仅需一次前馈传递,仅需一秒钟。我们成功的关键在于一个新颖的数据集,其中包含来自Objaverse的经过筛选和渲染的多视角视频动画对象。该数据集展示了44K个不同的对象,包含110K个动画,在48个视角下渲染,共生成了1200万个视频,总共包含3亿帧。为了实现可扩展性,我们保持了L4GM的简单性,并直接在预训练的3D大型重建模型LGM的基础上构建。LGM从多视角图像输入中输出3D高斯椭球体。L4GM从以低帧率采样的视频帧中输出逐帧的3D高斯飞溅表示,然后将表示上采样到更高的帧率以实现时间上的平滑。我们在基础LGM中添加了时间自注意力层,以帮助其学习跨时间的一致性,并利用每个时间步的多视角渲染损失来训练模型。通过训练一个插值模型,将表示上采样到更高的帧率,该模型产生中间的3D高斯表示。我们展示了,仅在合成数据上训练的L4GM在野外视频上表现出色,生成高质量的动画3D资产。
English
We present L4GM, the first 4D Large Reconstruction Model that produces
animated objects from a single-view video input -- in a single feed-forward
pass that takes only a second. Key to our success is a novel dataset of
multiview videos containing curated, rendered animated objects from Objaverse.
This dataset depicts 44K diverse objects with 110K animations rendered in 48
viewpoints, resulting in 12M videos with a total of 300M frames. We keep our
L4GM simple for scalability and build directly on top of LGM, a pretrained 3D
Large Reconstruction Model that outputs 3D Gaussian ellipsoids from multiview
image input. L4GM outputs a per-frame 3D Gaussian Splatting representation from
video frames sampled at a low fps and then upsamples the representation to a
higher fps to achieve temporal smoothness. We add temporal self-attention
layers to the base LGM to help it learn consistency across time, and utilize a
per-timestep multiview rendering loss to train the model. The representation is
upsampled to a higher framerate by training an interpolation model which
produces intermediate 3D Gaussian representations. We showcase that L4GM that
is only trained on synthetic data generalizes extremely well on in-the-wild
videos, producing high quality animated 3D assets.Summary
AI-Generated Summary