ストリーミング3D再構成のための幾何学的コンテキストトランスフォーマー

要旨

ストリーミング3D再構成は、ビデオストリームからカメラ姿勢や点群などの3次元情報を復元することを目的としており、幾何学的精度、時間的一貫性、計算効率が求められる。本研究では、SLAM（Simultaneous位置推定と地図構築）の原理に基づき、幾何学文脈トランスフォーマー（GCT）アーキテクチャを基盤としたストリーミングデータからのシーン再構成のためのfeed-forward型3D基盤モデル「LingBot-Map」を提案する。LingBot-Mapの特徴は、座標の接地、密な幾何学的手がかり、長距離ドリフト補正をそれぞれ扱うため、アンカー文脈、姿勢参照ウィンドウ、軌跡メモリを統合した注意機構を精巧に設計した点にある。この設計により、ストリーミング状態をコンパクトに保ちつつ豊富な幾何学的文脈を保持し、10,000フレームを超える長シーケンスにおいて518×378解像度入力で約20FPSの安定かつ効率的な推論を実現する。様々なベンチマークによる広範な評価により、本手法が既存のストリーミング手法及び反復最適化ベースの手法と比較して優れた性能を達成することを示す。

English

Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.

ストリーミング3D再構成のための幾何学的コンテキストトランスフォーマー

Geometric Context Transformer for Streaming 3D Reconstruction

要旨

Support