Diffuman4D：基于时空扩散模型的稀疏视角视频4D一致人体视图合成

摘要

本文探討了以稀疏視角視頻作為輸入的高保真人體視圖合成挑戰。先前的方法通過利用4D擴散模型來生成新視角的視頻，以解決觀測不足的問題。然而，這些模型生成的視頻往往缺乏時空一致性，從而降低了視圖合成的質量。本文提出了一種新穎的滑動迭代去噪過程，以增強4D擴散模型的時空一致性。具體而言，我們定義了一個潛在網格，其中每個潛在變量編碼了特定視角和時間戳下的圖像、相機姿態和人體姿態，然後使用滑動窗口在空間和時間維度上交替對潛在網格進行去噪，最後從相應的去噪潛在變量中解碼出目標視角的視頻。通過迭代滑動，信息在潛在網格中充分流動，使擴散模型能夠獲得較大的感受野，從而增強輸出的4D一致性，同時使GPU內存消耗保持在可承受範圍內。在DNA-Rendering和ActorsHQ數據集上的實驗表明，我們的方法能夠合成高質量且一致的新視角視頻，並顯著優於現有方法。請訪問我們的項目頁面查看互動演示和視頻結果：https://diffuman4d.github.io/。

English

This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the GPU memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches. See our project page for interactive demos and video results: https://diffuman4d.github.io/ .

Diffuman4D：基于时空扩散模型的稀疏视角视频4D一致人体视图合成

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

摘要

Support