ChatPaper.aiChatPaper

HorizonStream:面向流式三维重建的长程注意力

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

May 22, 2026
作者: Chong Cheng, Peilin Tao, Nanjie Yao, Guanzhi Ding, Xianda Chen, Yuansen Du, Xiaoyang Guo, Wei Yin, Weiqiang Ren, Qian Zhang, Zhengqing Chen, Hao Wang
cs.AI

摘要

在线三维重建需要在严格因果和有限内存约束下估计相机姿态和场景几何。现有方法在处理长序列时往往会出现漂移、抖动或崩溃。我们将这些失败归因于一个根本性的不匹配:流式几何本质上具有时间异质性,证据涵盖从短期对应关系到持久全局尺度。然而,当前架构施加了统一且病态的影响模式,例如滑窗强制硬截断,而无门控循环与因果注意力则导致缓存饱和及尖峰式注意力下沉。为解决此问题,我们将几何传播形式化为证据影响核,并提出HorizonStream——一种显式分解该核的长时域Transformer。针对长程时间因子,几何线性注意力通过学习逐通道衰减率,实现几何证据的有界多时间尺度传播;针对短程空间因子,结合时空RoPE的几何局部注意力在抑制注意力下沉的同时执行可靠的三维匹配。最后,度量读出令牌直接从持久几何状态中恢复稳定尺度与刚体姿态。大量实验表明,仅用48帧片段训练的HorizonStream,在恒定内存和线性时间下,可稳定泛化至超10000帧序列,达到流式三维重建的最优性能。项目主页:https://3dagentworld.github.io/horizonstream/
English
Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer from drift, jitter, or collapse on long sequences. We trace these failures to a fundamental mismatch. Streaming geometry is inherently temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale. However, current architectures impose uniform and pathological influence patterns. For example, sliding windows enforce hard cutoffs, while ungated recurrence and causal attention cause cache saturation and spike-like attention sinks. To resolve this, we formalize geometric propagation as an evidence influence kernel and propose HorizonStream, a long-horizon Transformer that explicitly factorizes this kernel. For the long-range temporal factor, Geometric Linear Attention learns channel-wise decay rates to enable bounded, multi-timescale propagation of geometric evidence. For the short-range spatial factor, Geometric Local Attention with Spatiotemporal RoPE performs reliable 3D matching while suppressing attention sinks. Finally, Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000\ frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance. Project Page: https://3dagentworld.github.io/horizonstream/