일관된 비디오 기하학 추정을 향하여

초록

본 연구는 비디오 시퀀스로부터 공간적으로 밀집되고 시간적으로 일관된 기하 정보를 복원하는 피드포워드 기반 모델 ViGeo를 제시한다. 작업별 아키텍처 수정 없이 단순 트랜스포머 아키텍처를 기반으로 구축된 ViGeo는 통합 모델 내에서 스트리밍, 전체 시퀀스 및 장기 비디오 추론을 지원한다. 핵심 설계는 동적 청킹 어텐션(dynamic chunking attention)으로, 이는 훈련 중에 모델이 양방향 및 인과적 시간적 맥락을 모두 접하도록 하고, 재훈련 없이 테스트 시점에 어텐션 패턴을 적응시킬 수 있게 한다. 또한, 지도(supervision) 품질을 향상시키기 위해 완성 기반 데이터 정제 프레임워크를 추가로 도입한다. 이 프레임워크는 희소하고 잡음이 있는 주석(annotation)을 조건으로 하여 비디오/다중 시점 맥락을 활용하여 밀집되고 시간적으로 일관되며 기하학적으로 신뢰할 수 있는 훈련 대상을 생성하는 비디오 깊이 완성 교사(video depth completion teacher)를 훈련한다. ViGeo는 깊이 및 포인트 맵 외에도 동일한 프레임워크 내에서 표면 법선을 예측한다. 공개 데이터셋만으로 훈련된 ViGeo는 온라인, 오프라인 및 장기 비디오 깊이 추정, 표면 법선 추정, 비디오 포인트 맵 추정 분야에서 최고 수준의 성능을 달성한다.

English

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.