UniGeo: ビデオ拡散モデルを制御し統一的な一貫性のあるジオメトリ推定を実現

要旨

近年、拡散モデルの事前知識を活用して単眼幾何推定（例えば深度や法線）を支援する手法が、その強力な汎化能力から注目を集めています。しかし、既存の研究の多くは、個々のビデオフレームのカメラ座標系内での幾何特性の推定に焦点を当てており、拡散モデルが持つフレーム間対応関係を決定する本質的な能力を無視しています。本研究では、適切な設計と微調整を通じて、ビデオ生成モデルの内在的な一貫性を、一貫した幾何推定に効果的に活用できることを示します。具体的には、1) ビデオフレームと同じ対応関係を持つグローバル座標系の幾何属性を予測対象として選択し、2) 位置エンコーディングを再利用する新規で効率的な条件付け手法を導入し、3) 同じ対応関係を共有する複数の幾何属性に対する共同学習を通じて性能を向上させます。我々の結果は、ビデオ内のグローバル幾何属性の予測において優れた性能を達成し、再構築タスクに直接適用可能です。静的ビデオデータのみで訓練された場合でも、本手法は動的ビデオシーンへの汎化の可能性を示しています。

English

Recently, methods leveraging diffusion model priors to assist monocular geometric estimation (e.g., depth and normal) have gained significant attention due to their strong generalization ability. However, most existing works focus on estimating geometric properties within the camera coordinate system of individual video frames, neglecting the inherent ability of diffusion models to determine inter-frame correspondence. In this work, we demonstrate that, through appropriate design and fine-tuning, the intrinsic consistency of video generation models can be effectively harnessed for consistent geometric estimation. Specifically, we 1) select geometric attributes in the global coordinate system that share the same correspondence with video frames as the prediction targets, 2) introduce a novel and efficient conditioning method by reusing positional encodings, and 3) enhance performance through joint training on multiple geometric attributes that share the same correspondence. Our results achieve superior performance in predicting global geometric attributes in videos and can be directly applied to reconstruction tasks. Even when trained solely on static video data, our approach exhibits the potential to generalize to dynamic video scenes.

UniGeo: ビデオ拡散モデルを制御し統一的な一貫性のあるジオメトリ推定を実現

UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation

要旨

Support