Diffuman4D：基于时空扩散模型的稀疏视角视频4D一致人体视图合成

摘要

本文针对以稀疏视角视频为输入的高保真人体视图合成这一挑战展开研究。现有方法通过利用4D扩散模型生成新视角视频来解决观测不足的问题，然而这些模型生成的视频往往缺乏时空一致性，从而降低了视图合成的质量。为此，我们提出了一种新颖的滑动迭代去噪过程，以增强4D扩散模型的时空一致性。具体而言，我们定义了一个潜在网格，其中每个潜在编码对应特定视角和时间点的图像、相机姿态及人体姿态，随后采用滑动窗口在空间和时间维度上交替对潜在网格进行去噪，最终从相应的去噪潜在中解码出目标视角的视频。通过迭代滑动，信息在潜在网格中充分流动，使得扩散模型能够获得较大的感受野，从而提升输出的4D一致性，同时将GPU内存消耗控制在可接受范围内。在DNA-Rendering和ActorsHQ数据集上的实验表明，我们的方法能够合成高质量且一致的新视角视频，显著优于现有方法。更多交互式演示及视频结果请访问我们的项目页面：https://diffuman4d.github.io/。

English

This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the GPU memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches. See our project page for interactive demos and video results: https://diffuman4d.github.io/ .

Diffuman4D：基于时空扩散模型的稀疏视角视频4D一致人体视图合成

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

摘要

Support