GeometryCrafter: 확산 사전 지식을 활용한 오픈 월드 비디오의 일관된 기하학적 추정

초록

비디오 깊이 추정 분야에서 놀라운 발전이 있었음에도 불구하고, 기존 방법들은 아핀 불변 예측을 통해 기하학적 정확도를 달성하는 데 있어 본질적인 한계를 보여주며, 이는 재구성 및 기타 미터법 기반의 다운스트림 작업에 대한 적용 가능성을 제한합니다. 우리는 GeometryCrafter를 제안합니다. 이는 오픈 월드 비디오에서 시간적 일관성을 가진 고정밀 포인트 맵 시퀀스를 복구하여 정확한 3D/4D 재구성, 카메라 파라미터 추정 및 기타 깊이 기반 응용 프로그램을 가능하게 하는 새로운 프레임워크입니다. 우리의 접근 방식의 핵심에는 비디오 잠재 분포에 구애받지 않는 잠재 공간을 학습하여 효과적인 포인트 맵 인코딩 및 디코딩을 수행하는 포인트 맵 변이형 오토인코더(VAE)가 있습니다. 이 VAE를 활용하여, 입력 비디오에 조건부된 포인트 맵 시퀀스의 분포를 모델링하기 위해 비디오 확산 모델을 학습시킵니다. 다양한 데이터셋에 대한 광범위한 평가를 통해 GeometryCrafter가 최첨단의 3D 정확도, 시간적 일관성 및 일반화 능력을 달성함을 입증합니다.

English

Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.

GeometryCrafter: 확산 사전 지식을 활용한 오픈 월드 비디오의 일관된 기하학적 추정

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

초록

Support