ChatPaper.aiChatPaper

几何上下文变换器在流式三维重建中的应用

Geometric Context Transformer for Streaming 3D Reconstruction

April 15, 2026
作者: Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, Yinghao Xu
cs.AI

摘要

流式三维重建旨在从视频流中恢复相机位姿与点云等三维信息,其技术核心需兼顾几何精度、时序一致性与计算效率。受同步定位与建图(SLAM)原理启发,我们提出基于几何上下文变换器(GCT)架构的LingBot-Map——一种面向流式数据重建场景的前馈式三维基础模型。该模型的标志性特征在于精心设计的注意力机制,通过融合锚点上下文、位姿参考窗与轨迹记忆模块,分别解决坐标定位、密集几何线索提取与长程漂移校正问题。该设计在保持几何上下文丰富性的同时压缩流式状态数据量,使其能在518×378分辨率输入下以约20帧/秒的速度对超过万帧的长序列进行稳定高效推理。多基准测试结果表明,本方法在流式重建与基于迭代优化的现有方法中均展现出卓越性能。
English
Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.
PDF21April 17, 2026