GenXD:生成任意3D和4D场景
GenXD: Generating Any 3D and 4D Scenes
November 4, 2024
作者: Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, Jianfeng Wang, Gim Hee Lee, Lijuan Wang
cs.AI
摘要
最近在2D视觉生成方面取得了显著成功。然而,在真实世界的应用中,3D和4D生成仍然具有挑战性,这是因为缺乏大规模的4D数据和有效的模型设计。在本文中,我们提出通过利用日常生活中常见的摄像机和物体运动,共同研究一般的3D和4D生成。由于社区中缺乏真实世界的4D数据,我们首先提出了一个数据整理流程,从视频中获取摄像机姿势和物体运动强度。基于这个流程,我们引入了一个大规模的真实世界4D场景数据集:CamVid-30K。通过利用所有的3D和4D数据,我们开发了我们的框架GenXD,它使我们能够生成任何3D或4D场景。我们提出了多视角-时间模块,可以将摄像机和物体运动解耦,从而无缝地学习来自3D和4D数据。此外,GenXD采用了蒙版潜在条件来支持各种条件视图。GenXD可以生成遵循摄像机轨迹的视频,以及可以转换为3D表示的一致的3D视图。我们在各种真实世界和合成数据集上进行了广泛评估,展示了GenXD在3D和4D生成方面相对于先前方法的有效性和多功能性。
English
Recent developments in 2D visual generation have been remarkably successful.
However, 3D and 4D generation remain challenging in real-world applications due
to the lack of large-scale 4D data and effective model design. In this paper,
we propose to jointly investigate general 3D and 4D generation by leveraging
camera and object movements commonly observed in daily life. Due to the lack of
real-world 4D data in the community, we first propose a data curation pipeline
to obtain camera poses and object motion strength from videos. Based on this
pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K.
By leveraging all the 3D and 4D data, we develop our framework, GenXD, which
allows us to produce any 3D or 4D scene. We propose multiview-temporal modules,
which disentangle camera and object movements, to seamlessly learn from both 3D
and 4D data. Additionally, GenXD employs masked latent conditions to support a
variety of conditioning views. GenXD can generate videos that follow the camera
trajectory as well as consistent 3D views that can be lifted into 3D
representations. We perform extensive evaluations across various real-world and
synthetic datasets, demonstrating GenXD's effectiveness and versatility
compared to previous methods in 3D and 4D generation.Summary
AI-Generated Summary