HorizonStream: ストリーミング3D再構成のための長期的アテンション

要旨

オンライン3D再構成では、厳格な因果性と有界メモリの制約下でカメラ姿勢とシーン形状を推定する必要がある。既存手法は長いシーケンスにおいてドリフト、ジッタ、あるいは崩壊に悩まされることが多い。我々はこれらの失敗の根本原因を基本的なミスマッチに求める。ストリーミング形状は本質的に時間的に不均一であり、短期間の対応関係から持続的な大域スケールに至るまでの証拠が存在する。しかし、現在のアーキテクチャは均一で病的な影響パターンを強制する。例えば、スライディングウィンドウはハードカットオフを課し、非ゲート型のリカレンスや因果的注意はキャッシュ飽和やスパイク状の注意沈み込みを引き起こす。この問題を解決するため、我々は幾何学的伝播を証拠影響カーネルとして形式化し、このカーネルを明示的に分解する長期地平トランスフォーマーであるHorizonStreamを提案する。長期的時間要素に対しては、Geometric Linear Attentionがチャネル方向の減衰率を学習し、幾何学的証拠の有界かつ複数時間スケールの伝播を可能にする。短期的空間要素に対しては、Spatiotemporal RoPEを用いたGeometric Local Attentionが信頼性の高い3Dマッチングを実行しつつ注意沈み込みを抑制する。最後に、Metric Readout Tokensが持続的な幾何学的状態から直接、安定したスケールと剛体姿勢を復元する。大規模実験により、HorizonStreamはわずか48フレームのクリップで学習しながら、10,000フレームを超えるシーケンスに安定して一般化し、一定メモリと線形時間で最先端のストリーミング3D再構成性能を達成することを示す。プロジェクトページ: https://3dagentworld.github.io/horizonstream/

English

Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer from drift, jitter, or collapse on long sequences. We trace these failures to a fundamental mismatch. Streaming geometry is inherently temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale. However, current architectures impose uniform and pathological influence patterns. For example, sliding windows enforce hard cutoffs, while ungated recurrence and causal attention cause cache saturation and spike-like attention sinks. To resolve this, we formalize geometric propagation as an evidence influence kernel and propose HorizonStream, a long-horizon Transformer that explicitly factorizes this kernel. For the long-range temporal factor, Geometric Linear Attention learns channel-wise decay rates to enable bounded, multi-timescale propagation of geometric evidence. For the short-range spatial factor, Geometric Local Attention with Spatiotemporal RoPE performs reliable 3D matching while suppressing attention sinks. Finally, Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000\ frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance. Project Page: https://3dagentworld.github.io/horizonstream/