迈向一致性视频几何估计

摘要

本文提出了ViGeo，一种用于从视频序列中恢复空间密集且时间一致几何结构的前馈基础模型。ViGeo基于纯Transformer架构，未采用特定任务的架构修改，在统一模型内支持流式推理、全序列推理及长视频推理。其核心设计是动态分块注意力机制，该机制在训练阶段使模型同时暴露于双向和因果时序上下文，并允许其在测试时无需重训练即可调整注意力模式。为提升监督质量，我们进一步提出了基于补全的数据细化框架。该框架训练了一个视频深度补全教师模型，该模型以稀疏且带噪声的标注为条件，利用视频/多视角上下文生成密集、时间连贯且几何可靠的训练目标。除深度图和点图外，ViGeo还在同一框架内预测表面法线。仅基于公开数据集训练，ViGeo在在线、离线及长视频深度估计、表面法线估计及视频点图估计任务中均达到了最先进性能。

English

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.