マルチビュー幾何学拡散を用いたゼロショット小説視点および深度合成

要旨

現在の方法では、疎なポーズ画像からの3Dシーン再構築には、ニューラルフィールド、ボクセルグリッド、または3Dガウス分布などの中間の3D表現が用いられ、複数の視点でのシーンの外観とジオメトリを一貫させています。本論文では、MVGDという拡散ベースのアーキテクチャを紹介し、任意の入力ビューの数から、新しい視点からの画像と深度マップを直接ピクセルレベルで生成できる能力を持たせます。当手法では、レイマップ条件付けを使用して、視覚的特徴を異なる視点からの空間情報で拡張し、また新しい視点からの画像と深度マップの生成を誘導します。当アプローチの重要な側面は、画像と深度マップのマルチタスク生成であり、学習可能なタスク埋め込みを使用して、拡散プロセスを特定のモダリティに誘導します。私たちは、公開されているデータセットからの6,000万以上のマルチビューサンプルのコレクションでこのモデルを訓練し、このような多様な条件下での効率的かつ一貫した学習を可能にする技術を提案します。また、効果的な大規模モデルの訓練を可能にする新しい戦略を提案し、約束されるスケーリング動作を示します。幅広い実験を通じて、新しい視点合成のベンチマークやマルチビューステレオ、ビデオ深度推定において、最先端の結果を報告します。

English

Current methods for 3D scene reconstruction from sparse posed images employ intermediate 3D representations such as neural fields, voxel grids, or 3D Gaussians, to achieve multi-view consistent scene appearance and geometry. In this paper we introduce MVGD, a diffusion-based architecture capable of direct pixel-level generation of images and depth maps from novel viewpoints, given an arbitrary number of input views. Our method uses raymap conditioning to both augment visual features with spatial information from different viewpoints, as well as to guide the generation of images and depth maps from novel views. A key aspect of our approach is the multi-task generation of images and depth maps, using learnable task embeddings to guide the diffusion process towards specific modalities. We train this model on a collection of more than 60 million multi-view samples from publicly available datasets, and propose techniques to enable efficient and consistent learning in such diverse conditions. We also propose a novel strategy that enables the efficient training of larger models by incrementally fine-tuning smaller ones, with promising scaling behavior. Through extensive experiments, we report state-of-the-art results in multiple novel view synthesis benchmarks, as well as multi-view stereo and video depth estimation.

マルチビュー幾何学拡散を用いたゼロショット小説視点および深度合成

Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion

要旨

Support