VideoFrom3D：基于互补图像与视频扩散模型的三维场景视频生成

摘要

本文提出了一种名为VideoFrom3D的创新框架，旨在从粗略几何、相机轨迹及参考图像中合成高质量的三维场景视频。该框架简化了三维图形设计流程，实现了灵活的设计探索与快速成果产出。一种直接从粗略几何合成视频的方法可能是基于几何结构来调节视频扩散模型。然而，现有的视频扩散模型在生成复杂场景的高保真结果时面临挑战，主要因为难以同时建模视觉质量、运动及时序一致性。为解决这一问题，我们提出了一种生成框架，该框架结合了图像与视频扩散模型的互补优势。具体而言，我们的框架由稀疏锚点视图生成（SAG）模块和几何引导的生成中间帧（GGI）模块组成。SAG模块利用图像扩散模型，在稀疏外观引导采样的辅助下，生成高质量且跨视图一致的锚点视图。基于这些锚点视图，GGI模块通过视频扩散模型，结合基于流量的相机控制与结构引导，忠实插值中间帧。值得注意的是，这两个模块均无需依赖任何三维场景模型与自然图像的配对数据集，这类数据集极难获取。全面的实验表明，我们的方法在多样且具挑战性的场景下，能够生成高质量、风格一致的场景视频，表现优于简单及扩展的基线方法。

English

In this paper, we propose VideoFrom3D, a novel framework for synthesizing high-quality 3D scene videos from coarse geometry, a camera trajectory, and a reference image. Our approach streamlines the 3D graphic design workflow, enabling flexible design exploration and rapid production of deliverables. A straightforward approach to synthesizing a video from coarse geometry might condition a video diffusion model on geometric structure. However, existing video diffusion models struggle to generate high-fidelity results for complex scenes due to the difficulty of jointly modeling visual quality, motion, and temporal consistency. To address this, we propose a generative framework that leverages the complementary strengths of image and video diffusion models. Specifically, our framework consists of a Sparse Anchor-view Generation (SAG) and a Geometry-guided Generative Inbetweening (GGI) module. The SAG module generates high-quality, cross-view consistent anchor views using an image diffusion model, aided by Sparse Appearance-guided Sampling. Building on these anchor views, GGI module faithfully interpolates intermediate frames using a video diffusion model, enhanced by flow-based camera control and structural guidance. Notably, both modules operate without any paired dataset of 3D scene models and natural images, which is extremely difficult to obtain. Comprehensive experiments show that our method produces high-quality, style-consistent scene videos under diverse and challenging scenarios, outperforming simple and extended baselines.

VideoFrom3D：基于互补图像与视频扩散模型的三维场景视频生成

VideoFrom3D: 3D Scene Video Generation via Complementary Image and Video Diffusion Models

摘要

Support