拡散モデルによる空間と時間の制御

要旨

4DiMを紹介します。これは、一般的なシーンの1枚以上の画像と、カメラポーズおよびタイムスタンプのセットを条件とした、4D新規視点合成（NVS）のためのカスケード型拡散モデルです。4Dトレーニングデータの限られた可用性による課題を克服するため、3D（カメラポーズ付き）、4D（ポーズ+時間）、およびビデオ（時間のみ、ポーズなし）データの共同トレーニングを提唱し、これを可能にする新しいアーキテクチャを提案します。さらに、単眼メトリック深度推定器を使用してSfMポーズデータを較正し、メトリックスケールのカメラ制御を実現します。モデル評価のために、現在の評価スキームの欠点を補い、豊かにする新しいメトリクスを導入し、3D NVSのための既存の拡散モデルと比較して、忠実度とポーズ制御の両方で最先端の結果を示すと同時に、時間的ダイナミクスを処理する能力を追加します。4DiMは、パノラマステッチングの改善、ポーズ条件付きビデオからビデオへの変換、およびその他のいくつかのタスクにも使用されます。概要については、https://4d-diffusion.github.io をご覧ください。

English

We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), conditioned on one or more images of a general scene, and a set of camera poses and timestamps. To overcome challenges due to limited availability of 4D training data, we advocate joint training on 3D (with camera pose), 4D (pose+time) and video (time but no pose) data and propose a new architecture that enables the same. We further advocate the calibration of SfM posed data using monocular metric depth estimators for metric scale camera control. For model evaluation, we introduce new metrics to enrich and overcome shortcomings of current evaluation schemes, demonstrating state-of-the-art results in both fidelity and pose control compared to existing diffusion models for 3D NVS, while at the same time adding the ability to handle temporal dynamics. 4DiM is also used for improved panorama stitching, pose-conditioned video to video translation, and several other tasks. For an overview see https://4d-diffusion.github.io

拡散モデルによる空間と時間の制御

Controlling Space and Time with Diffusion Models

要旨

Support