ChatPaper.aiChatPaper

DimensionX:通过可控视频扩散从单个图像创建任意的3D和4D场景

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

November 7, 2024
作者: Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, Yikai Wang
cs.AI

摘要

本文介绍了DimensionX,这是一个旨在利用视频扩散从单个图像中生成逼真的3D和4D场景的框架。我们的方法始于这样一个洞察:3D场景的空间结构和4D场景的时间演变可以通过视频帧序列有效表示。尽管最近的视频扩散模型在生成生动视觉方面取得了显著成功,但由于在生成过程中空间和时间的可控性有限,它们在直接恢复3D/4D场景方面存在局限性。为了克服这一问题,我们提出了ST-Director,通过从维度变体数据中学习维度感知的LoRA,将视频扩散中的空间和时间因素解耦。这种可控的视频扩散方法使得对空间结构和时间动态进行精确操作成为可能,使我们能够通过空间和时间维度的组合从连续帧中重建3D和4D表示。此外,为了弥合生成视频和真实场景之间的差距,我们引入了一种面向轨迹的机制用于3D生成,以及一种保持身份的去噪策略用于4D生成。对各种真实和合成数据集的广泛实验表明,与先前方法相比,DimensionX在可控视频生成以及3D和4D场景生成方面取得了优越的结果。
English
In this paper, we introduce DimensionX, a framework designed to generate photorealistic 3D and 4D scenes from just a single image with video diffusion. Our approach begins with the insight that both the spatial structure of a 3D scene and the temporal evolution of a 4D scene can be effectively represented through sequences of video frames. While recent video diffusion models have shown remarkable success in producing vivid visuals, they face limitations in directly recovering 3D/4D scenes due to limited spatial and temporal controllability during generation. To overcome this, we propose ST-Director, which decouples spatial and temporal factors in video diffusion by learning dimension-aware LoRAs from dimension-variant data. This controllable video diffusion approach enables precise manipulation of spatial structure and temporal dynamics, allowing us to reconstruct both 3D and 4D representations from sequential frames with the combination of spatial and temporal dimensions. Additionally, to bridge the gap between generated videos and real-world scenes, we introduce a trajectory-aware mechanism for 3D generation and an identity-preserving denoising strategy for 4D generation. Extensive experiments on various real-world and synthetic datasets demonstrate that DimensionX achieves superior results in controllable video generation, as well as in 3D and 4D scene generation, compared with previous methods.
PDF574December 4, 2025