ChatPaper.aiChatPaper

全景时空:驯服视频扩散模型以生成全景四维场景

HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation

April 30, 2025
作者: Haiyang Zhou, Wangbo Yu, Jiawen Guan, Xinhua Cheng, Yonghong Tian, Li Yuan
cs.AI

摘要

擴散模型的快速發展有望徹底改變虛擬實境(VR)和增強實境(AR)技術的應用,這些技術通常需要場景級別的4D資產來提升用戶體驗。然而,現有的擴散模型主要集中於建模靜態3D場景或物體級別的動態,限制了其提供真正沉浸式體驗的能力。為解決這一問題,我們提出了HoloTime框架,該框架整合了視頻擴散模型,能夠從單一提示或參考圖像生成全景視頻,並結合360度4D場景重建方法,將生成的全景視頻無縫轉換為4D資產,從而為用戶提供完全沉浸的4D體驗。具體而言,為了馴服視頻擴散模型以生成高保真度的全景視頻,我們引入了360World數據集,這是首個適用於下游4D場景重建任務的全景視頻綜合集合。基於這一精心策劃的數據集,我們提出了全景動畫師(Panoramic Animator),這是一個兩階段的圖像到視頻擴散模型,能夠將全景圖像轉換為高質量的全景視頻。隨後,我們展示了全景時空重建(Panoramic Space-Time Reconstruction),該方法利用時空深度估計技術,將生成的全景視頻轉化為4D點雲,從而優化整體的4D高斯潑濺表示,重建空間和時間上一致的4D場景。為了驗證我們方法的有效性,我們與現有方法進行了比較分析,結果顯示我們的方法在全景視頻生成和4D場景重建方面均具有優勢。這表明我們的方法能夠創造更具吸引力和真實感的沉浸式環境,從而提升VR和AR應用中的用戶體驗。
English
The rapid advancement of diffusion models holds the promise of revolutionizing the application of VR and AR technologies, which typically require scene-level 4D assets for user experience. Nonetheless, existing diffusion models predominantly concentrate on modeling static 3D scenes or object-level dynamics, constraining their capacity to provide truly immersive experiences. To address this issue, we propose HoloTime, a framework that integrates video diffusion models to generate panoramic videos from a single prompt or reference image, along with a 360-degree 4D scene reconstruction method that seamlessly transforms the generated panoramic video into 4D assets, enabling a fully immersive 4D experience for users. Specifically, to tame video diffusion models for generating high-fidelity panoramic videos, we introduce the 360World dataset, the first comprehensive collection of panoramic videos suitable for downstream 4D scene reconstruction tasks. With this curated dataset, we propose Panoramic Animator, a two-stage image-to-video diffusion model that can convert panoramic images into high-quality panoramic videos. Following this, we present Panoramic Space-Time Reconstruction, which leverages a space-time depth estimation method to transform the generated panoramic videos into 4D point clouds, enabling the optimization of a holistic 4D Gaussian Splatting representation to reconstruct spatially and temporally consistent 4D scenes. To validate the efficacy of our method, we conducted a comparative analysis with existing approaches, revealing its superiority in both panoramic video generation and 4D scene reconstruction. This demonstrates our method's capability to create more engaging and realistic immersive environments, thereby enhancing user experiences in VR and AR applications.

Summary

AI-Generated Summary

PDF111May 7, 2025