ChatPaper.aiChatPaper

HorizonStream:用於流式三維重建的長程注意力

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

May 22, 2026
作者: Chong Cheng, Peilin Tao, Nanjie Yao, Guanzhi Ding, Xianda Chen, Yuansen Du, Xiaoyang Guo, Wei Yin, Weiqiang Ren, Qian Zhang, Zhengqing Chen, Hao Wang
cs.AI

摘要

線上3D重建需在嚴格的因果性與有限記憶體限制下,估計相機姿態與場景幾何。現有方法在長序列中常出現飄移、抖動或崩潰。我們將這些失敗追溯至一個根本性的錯配:串流幾何本質上具有時間異質性,證據涵蓋從短暫對應到持久全局尺度。然而,當前架構施加了統一且病態的影響模式——例如滑動視窗強加硬性截斷,而無閘控迴圈與因果注意力則導致快取飽和及尖峰式注意力沉點。為解決此問題,我們將幾何傳播形式化為證據影響核,並提出HorizonStream——一種明確分解此核的長時域Transformer。針對長程時間因子,幾何線性注意力學習通道層級衰減率,以實現有限、多時間尺度的幾何證據傳播;針對短程空間因子,幾何局部注意力搭配時空旋轉位置編碼執行可靠的3D匹配,同時抑制注意力沉點。最後,度量讀取標記直接從持久幾何狀態中恢復穩定尺度與剛體姿態。大量實驗證明,僅以48幀片段訓練的HorizonStream,能在常數記憶體與線性時間下穩定泛化至超過10,000幀的序列,達到最先進的串流3D重建效能。專案頁面:https://3dagentworld.github.io/horizonstream/
English
Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer from drift, jitter, or collapse on long sequences. We trace these failures to a fundamental mismatch. Streaming geometry is inherently temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale. However, current architectures impose uniform and pathological influence patterns. For example, sliding windows enforce hard cutoffs, while ungated recurrence and causal attention cause cache saturation and spike-like attention sinks. To resolve this, we formalize geometric propagation as an evidence influence kernel and propose HorizonStream, a long-horizon Transformer that explicitly factorizes this kernel. For the long-range temporal factor, Geometric Linear Attention learns channel-wise decay rates to enable bounded, multi-timescale propagation of geometric evidence. For the short-range spatial factor, Geometric Local Attention with Spatiotemporal RoPE performs reliable 3D matching while suppressing attention sinks. Finally, Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000\ frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance. Project Page: https://3dagentworld.github.io/horizonstream/