Naar consistente video-geometrieschatting

Samenvatting

Dit werk presenteert ViGeo, een feed-forward funderingsmodel voor het herstellen van ruimtelijk dichte en temporeel consistente geometrie uit videosequenties. Gebouwd op een eenvoudige transformerarchitectuur zonder taakspecifieke aanpassingen, ondersteunt ViGeo streaming-, volledige-sequentie- en lange-video-inferentie binnen één enkel model. Het belangrijkste ontwerp is dynamische chunking-aandacht, die het model tijdens training blootstelt aan zowel bidirectionele als causale temporele contexten en het in staat stelt om zijn aandachtspatroon tijdens testtijd aan te passen zonder hertraining. Om de supervisiekwaliteit te verbeteren, introduceren we verder een op aanvulling gebaseerd dataverfijningsraamwerk. Dit raamwerk traint een video-diepte-aanvullingsleraar die conditioneert op schaarse en ruizige annotaties en videocontext/multiview-context benut om dichte, temporeel coherente en geometrisch betrouwbare trainingsdoelen te produceren. Naast diepte- en puntenkaarten voorspelt ViGeo ook oppervlaktenormalen binnen hetzelfde raamwerk. Getraind uitsluitend op openbare datasets, behaalt ViGeo state-of-the-art prestaties op het gebied van online, offline en lange-video-diepteschatting, oppervlaktenormaalschatting en videopuntkaartschatting.

English

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.