Geo4D: ビデオ生成器を活用した幾何学的4Dシーン再構築

要旨

本論文では、動的シーンの単眼3D再構成のためにビデオ拡散モデルを再利用する手法「Geo4D」を紹介する。Geo4Dは、ビデオモデルが持つ強力な動的プリオールを活用することで、合成データのみを用いて学習しつつ、ゼロショット方式で実データにうまく一般化することができる。Geo4Dは、ポイントマップ、深度マップ、レイマップといった複数の補完的な幾何学的モダリティを予測する。推論時には、新たなマルチモーダルアライメントアルゴリズムを用いてこれらのモダリティを整列・融合し、さらに複数のスライディングウィンドウを活用することで、長時間ビデオの頑健かつ正確な4D再構成を実現する。複数のベンチマークにわたる広範な実験により、Geo4Dが動的シーンを扱うように設計されたMonST3Rなどの最新手法を含む、最先端のビデオ深度推定手法を大幅に上回る性能を示すことが確認された。

English

We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic prior captured by such video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, depth, and ray maps. It uses a new multi-modal alignment algorithm to align and fuse these modalities, as well as multiple sliding windows, at inference time, thus obtaining robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods, including recent methods such as MonST3R, which are also designed to handle dynamic scenes.