ChatPaper.aiChatPaper

VideoRFSplat:基於靈活姿態與多視角聯合建模的場景級文本到3D高斯濺射生成

VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling

March 20, 2025
作者: Hyojun Go, Byeongjun Park, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, Changick Kim
cs.AI

摘要

我們提出了VideoRFSplat,這是一個直接從文本生成3D的模型,它利用視頻生成模型來為無界真實場景生成逼真的3D高斯潑濺(3DGS)。為了生成多樣化的相機姿態和無界的真實場景空間範圍,同時確保對任意文本提示的泛化能力,先前的方法通過微調2D生成模型來聯合建模相機姿態和多視角圖像。然而,這些方法在將2D生成模型擴展到聯合建模時,由於模態差距而存在不穩定性,這需要額外的模型來穩定訓練和推理。在本工作中,我們提出了一種架構和採樣策略,在微調視頻生成模型時聯合建模多視角圖像和相機姿態。我們的核心思想是一個雙流架構,該架構通過通信塊將專用的姿態生成模型附加到預訓練的視頻生成模型旁邊,通過獨立的流生成多視角圖像和相機姿態。這種設計減少了姿態和圖像模態之間的干擾。此外,我們提出了一種異步採樣策略,該策略使相機姿態的去噪速度比多視角圖像更快,從而允許快速去噪的姿態來條件化多視角生成,減少相互模糊性並增強跨模態一致性。在多個大規模真實世界數據集(RealEstate10K、MVImgNet、DL3DV-10K、ACID)上訓練後,VideoRFSplat在不需要通過分數蒸餾採樣進行後處理精煉的情況下,超越了現有的依賴於此類精煉的文本到3D直接生成方法,取得了優異的結果。
English
We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling, achieving superior results without such refinement.

Summary

AI-Generated Summary

PDF32March 21, 2025