面向流式三维重建的几何上下文变换器
Geometric Context Transformer for Streaming 3D Reconstruction
April 15, 2026
作者: Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, Yinghao Xu
cs.AI
摘要
流式三维重建旨在从视频流中恢复相机姿态与点云等三维信息,其技术核心需兼顾几何精度、时序一致性与计算效率。受同步定位与建图(SLAM)原理启发,我们基于几何上下文变换器(GCT)架构提出LingBot-Map——一种面向流式数据场景重建的前馈式三维基础模型。该模型的显著特性在于精心设计的注意力机制,该机制通过锚点上下文、姿态参考窗和轨迹记忆模块,分别解决坐标系锚定、密集几何线索提取和长程漂移校正问题。这一设计在保持流式状态紧凑性的同时留存了丰富的几何上下文,使其能在518×378分辨率输入下以约20帧/秒的效率稳定处理超万帧的长序列数据。在多类基准测试中的广泛评估表明,本方法相较于现有流式重建与基于迭代优化的方法均展现出更优性能。
English
Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal
consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation
model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully
designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and
long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around
20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach
achieves superior performance compared to both existing streaming and iterative optimization-based approaches.