스트리밍 3D 재구성을 위한 기하학적 컨텍스트 트랜스포머

초록

스트리밍 3D 재구성은 비디오 스트림에서 카메라 포즈 및 포인트 클라우드와 같은 3D 정보를 복원하는 것을 목표로 하며, 이를 위해 기하학적 정확도, 시간적 일관성 및 계산 효율성이 필요합니다. 동시적 위치 추정 및 매핑(SLAM) 원리에 기반하여, 우리는 기하학적 컨텍스트 변환기(GCT) 아키텍처를 기반으로 스트리밍 데이터에서 장면을 재구성하는 피드포워드 3D 파운데이션 모델인 LingBot-Map을 제안합니다. LingBot-Map의 핵심 특징은 좌표 기반 정착, 조밀한 기하학적 단서, 장거리 드리프트 보정을 각각 처리하기 위해 앵커 컨텍스트, 포즈 참조 창, 궤적 메모리를 통합하는 신중하게 설계된 어텐션 메커니즘에 있습니다. 이 설계는 풍부한 기하학적 컨텍스트를 유지하면서 스트리밍 상태를 간결하게 유지하여 10,000프레임을 초과하는 긴 시퀀스에서 518 x 378 해상도 입력에 대해 약 20 FPS의 안정적이고 효율적인 추론을 가능하게 합니다. 다양한 벤치마크에 걸친 포괄적인 평가 결과, 우리의 접근 방식이 기존의 스트리밍 방식 및 반복적 최적화 기반 접근 방식과 비교하여 우수한 성능을 달성함을 입증하였습니다.

English

Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.

스트리밍 3D 재구성을 위한 기하학적 컨텍스트 트랜스포머

Geometric Context Transformer for Streaming 3D Reconstruction

초록

Support