ChatPaper.aiChatPaper

DimensionX:通過可控視頻擴散從單張圖像中創建任意3D和4D場景

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

November 7, 2024
作者: Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, Yikai Wang
cs.AI

摘要

本文介紹了DimensionX,這是一個旨在通過視頻擴散從單張圖像生成逼真的3D和4D場景的框架。我們的方法始於一個洞察,即3D場景的空間結構和4D場景的時間演變可以通過視頻幀序列有效表示。儘管最近的視頻擴散模型在生成生動視覺效果方面取得了顯著成功,但由於在生成過程中存在空間和時間控制能力有限,它們在直接恢復3D/4D場景方面存在局限性。為了克服這一問題,我們提出了ST-Director,通過從維度變異數據中學習維度感知的LoRAs,將視頻擴散中的空間和時間因素解耦。這種可控的視頻擴散方法使我們能夠精確操縱空間結構和時間動態,從而通過空間和時間維度的結合從連續幀中重建3D和4D表示。此外,為了彌合生成的視頻與現實場景之間的差距,我們引入了一種用於3D生成的軌跡感知機制和一種用於4D生成的保持身份的去噪策略。對各種真實世界和合成數據集的大量實驗表明,與先前方法相比,DimensionX在可控視頻生成以及3D和4D場景生成方面取得了優異的結果。
English
In this paper, we introduce DimensionX, a framework designed to generate photorealistic 3D and 4D scenes from just a single image with video diffusion. Our approach begins with the insight that both the spatial structure of a 3D scene and the temporal evolution of a 4D scene can be effectively represented through sequences of video frames. While recent video diffusion models have shown remarkable success in producing vivid visuals, they face limitations in directly recovering 3D/4D scenes due to limited spatial and temporal controllability during generation. To overcome this, we propose ST-Director, which decouples spatial and temporal factors in video diffusion by learning dimension-aware LoRAs from dimension-variant data. This controllable video diffusion approach enables precise manipulation of spatial structure and temporal dynamics, allowing us to reconstruct both 3D and 4D representations from sequential frames with the combination of spatial and temporal dimensions. Additionally, to bridge the gap between generated videos and real-world scenes, we introduce a trajectory-aware mechanism for 3D generation and an identity-preserving denoising strategy for 4D generation. Extensive experiments on various real-world and synthetic datasets demonstrate that DimensionX achieves superior results in controllable video generation, as well as in 3D and 4D scene generation, compared with previous methods.
PDF574December 4, 2025