一貫性のあるビデオ幾何学推定に向けて

要旨

本研究では、ビデオシーケンスから空間的に密で時間的に一貫した幾何情報を復元するフィードフォワード基盤モデルViGeoを提案する。ViGeoはタスク固有のアーキテクチャ変更を施さないプレーンなトランスフォーマーアーキテクチャ上に構築されており、統合モデル内でストリーミング、全シーケンス、長尺動画の推論をサポートする。主要な設計は動的チャンキングアテンションであり、訓練中に双方向および因果的時間的文脈の両方にモデルを触れさせ、テスト時には再学習なしでアテンションパターンを適応可能にする。教師信号の品質を向上させるため、さらに補完ベースのデータ精緻化フレームワークを導入する。このフレームワークは、疎でノイズの多いアノテーションを条件とし、ビデオ/多視点コンテキストを活用して、密で時間的に一貫し、幾何学的に信頼性の高い訓練ターゲットを生成するビデオ深度補完教師モデルを訓練する。深度マップやポイントマップに加えて、ViGeoは同一フレームワーク内で表面法線も予測する。公開データセットのみで訓練されたViGeoは、オンライン、オフライン、長尺動画の深度推定、表面法線推定、ビデオポイントマップ推定において最先端の性能を達成する。

English

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.