ChatPaper.aiChatPaper

HoloTime:驾驭视频扩散模型实现全景4D场景生成

HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation

April 30, 2025
作者: Haiyang Zhou, Wangbo Yu, Jiawen Guan, Xinhua Cheng, Yonghong Tian, Li Yuan
cs.AI

摘要

扩散模型的快速发展有望彻底改变虚拟现实(VR)和增强现实(AR)技术的应用,这些技术通常需要场景级别的4D资源来提升用户体验。然而,现有的扩散模型主要集中于静态3D场景或对象级动态的建模,限制了其提供真正沉浸式体验的能力。为解决这一问题,我们提出了HoloTime框架,该框架整合了视频扩散模型,能够从单一提示或参考图像生成全景视频,并结合360度4D场景重建方法,将生成的全景视频无缝转化为4D资源,为用户提供完全沉浸的4D体验。具体而言,为了驯服视频扩散模型以生成高保真全景视频,我们引入了360World数据集,这是首个适用于下游4D场景重建任务的全景视频综合集合。基于这一精选数据集,我们提出了全景动画师(Panoramic Animator),一个两阶段的图像到视频扩散模型,能够将全景图像转换为高质量的全景视频。随后,我们展示了全景时空重建(Panoramic Space-Time Reconstruction),该方法利用时空深度估计技术,将生成的全景视频转化为4D点云,进而优化整体4D高斯溅射表示,重建空间和时间上一致的4D场景。为验证方法的有效性,我们与现有方法进行了对比分析,结果显示在生成全景视频和4D场景重建方面均具有显著优势。这证明了我们的方法能够创造更具吸引力和真实感的沉浸环境,从而提升VR和AR应用中的用户体验。
English
The rapid advancement of diffusion models holds the promise of revolutionizing the application of VR and AR technologies, which typically require scene-level 4D assets for user experience. Nonetheless, existing diffusion models predominantly concentrate on modeling static 3D scenes or object-level dynamics, constraining their capacity to provide truly immersive experiences. To address this issue, we propose HoloTime, a framework that integrates video diffusion models to generate panoramic videos from a single prompt or reference image, along with a 360-degree 4D scene reconstruction method that seamlessly transforms the generated panoramic video into 4D assets, enabling a fully immersive 4D experience for users. Specifically, to tame video diffusion models for generating high-fidelity panoramic videos, we introduce the 360World dataset, the first comprehensive collection of panoramic videos suitable for downstream 4D scene reconstruction tasks. With this curated dataset, we propose Panoramic Animator, a two-stage image-to-video diffusion model that can convert panoramic images into high-quality panoramic videos. Following this, we present Panoramic Space-Time Reconstruction, which leverages a space-time depth estimation method to transform the generated panoramic videos into 4D point clouds, enabling the optimization of a holistic 4D Gaussian Splatting representation to reconstruct spatially and temporally consistent 4D scenes. To validate the efficacy of our method, we conducted a comparative analysis with existing approaches, revealing its superiority in both panoramic video generation and 4D scene reconstruction. This demonstrates our method's capability to create more engaging and realistic immersive environments, thereby enhancing user experiences in VR and AR applications.

Summary

AI-Generated Summary

PDF111May 7, 2025