Diffuman4D: 時空間拡散モデルを用いたスパースビデオからの4D整合性のある人間ビュー合成

要旨

本論文は、疎なビデオ入力を用いた人間の高忠実度視点合成の課題に取り組む。従来の手法では、4D拡散モデルを活用して新規視点のビデオを生成することで、観測不足の問題を解決していた。しかし、これらのモデルから生成されたビデオはしばしば時空間的一貫性を欠き、視点合成の品質を低下させていた。本論文では、4D拡散モデルの時空間的一貫性を向上させるための新しいスライディング反復的ノイズ除去プロセスを提案する。具体的には、各潜在変数が特定の視点とタイムスタンプにおける画像、カメラポーズ、人間のポーズをエンコードする潜在グリッドを定義し、スライディングウィンドウを用いて空間次元と時間次元に沿って交互に潜在グリッドをノイズ除去し、最終的に対応するノイズ除去された潜在変数から目標視点のビデオをデコードする。反復的なスライディングを通じて、情報が潜在グリッド全体に十分に流れることで、拡散モデルが大きな受容野を得て出力の4D一貫性を向上させると同時に、GPUメモリ消費を許容範囲内に抑えることができる。DNA-RenderingおよびActorsHQデータセットでの実験により、本手法が高品質で一貫性のある新規視点ビデオを合成し、既存の手法を大幅に上回ることを実証した。インタラクティブなデモとビデオ結果についてはプロジェクトページを参照：https://diffuman4d.github.io/。

English

This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the GPU memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches. See our project page for interactive demos and video results: https://diffuman4d.github.io/ .

Diffuman4D: 時空間拡散モデルを用いたスパースビデオからの4D整合性のある人間ビュー合成

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

要旨

Support