ChatPaper.aiChatPaper

VideoRFSplat:基于灵活姿态与多视图联合建模的场景级文本到3D高斯溅射直接生成

VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling

March 20, 2025
作者: Hyojun Go, Byeongjun Park, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, Changick Kim
cs.AI

摘要

我们提出了VideoRFSplat,一种直接文本到3D的模型,利用视频生成模型为无界真实场景生成逼真的3D高斯溅射(3DGS)。为了生成多样化的相机姿态和真实场景的无界空间范围,同时确保对任意文本提示的泛化能力,先前的方法通过微调2D生成模型来联合建模相机姿态和多视角图像。然而,由于模态差异,这些方法在将2D生成模型扩展到联合建模时存在不稳定性,需要额外的模型来稳定训练和推理。在本工作中,我们提出了一种架构和采样策略,在微调视频生成模型时联合建模多视角图像和相机姿态。我们的核心思想是双流架构,通过通信模块将专用的姿态生成模型与预训练的视频生成模型相连,通过独立的流生成多视角图像和相机姿态。这种设计减少了姿态和图像模态之间的干扰。此外,我们提出了一种异步采样策略,使相机姿态的去噪速度比多视角图像更快,允许快速去噪的姿态条件化多视角生成,减少相互模糊并增强跨模态一致性。在多个大规模真实世界数据集(RealEstate10K、MVImgNet、DL3DV-10K、ACID)上训练后,VideoRFSplat在无需通过分数蒸馏采样进行后处理优化的前提下,超越了现有依赖此类优化的文本到3D直接生成方法,取得了更优的结果。
English
We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling, achieving superior results without such refinement.

Summary

AI-Generated Summary

PDF32March 21, 2025