迈向一致性视频几何估计

摘要

本研究提出ViGeo，一個基於前饋式基礎模型，旨在從影片序列中恢復空間密集且時間一致的幾何結構。該模型以純變壓器架構為基礎，未針對特定任務修改結構，並能在單一模型中支援串流、完整序列及長影片推論。其核心設計為動態分塊注意力機制，使模型在訓練期間同時接觸雙向與因果時間上下文，並能在測試時調整注意力模式而無需重新訓練。為提升監督品質，我們進一步引入基於補全的資料精煉框架。該框架訓練一個影片深度補全教師模型，以其稀疏且含噪的標註為條件，利用影片/多視角上下文生成密集、時間連貫且幾何可靠的訓練目標。除深度圖與點雲圖外，ViGeo亦在同一框架內預測表面法向量。僅使用公開資料集訓練，ViGeo在線上、離線及長影片深度估計、表面法向量估計及影片點雲圖估計上均達到最先進水準。

English

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.