GeometryCrafter: 拡散事前分布を用いたオープンワールド動画のための一貫性あるジオメトリ推定

要旨

ビデオ深度推定における目覚ましい進展にもかかわらず、既存の手法はアフィン不変な予測を通じて幾何学的忠実度を達成する上で本質的な制限を示しており、再構築やその他の計量的に基づいた下流タスクへの適用性が制限されています。本論文では、GeometryCrafterを提案します。これは、オープンワールドのビデオから時間的整合性のある高忠実度ポイントマップシーケンスを復元し、正確な3D/4D再構築、カメラパラメータ推定、およびその他の深度ベースのアプリケーションを可能にする新しいフレームワークです。我々のアプローチの中核には、ビデオの潜在分布に依存しない潜在空間を学習し、効果的なポイントマップのエンコーディングとデコーディングを実現するポイントマップ変分オートエンコーダ（VAE）があります。このVAEを活用して、入力ビデオに条件付けられたポイントマップシーケンスの分布をモデル化するビデオ拡散モデルを学習します。多様なデータセットでの広範な評価により、GeometryCrafterが最先端の3D精度、時間的一貫性、および汎化能力を達成することが実証されています。

English

Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.

GeometryCrafter: 拡散事前分布を用いたオープンワールド動画のための一貫性あるジオメトリ推定

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

要旨

Support