STream3R:基於因果Transformer的可擴展序列3D重建
STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer
August 14, 2025
作者: Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, Xingang Pan
cs.AI
摘要
我們提出了STream3R,這是一種新穎的三維重建方法,它將點雲圖預測重新表述為僅解碼器的Transformer問題。現有的多視角重建最先進方法要么依賴於昂貴的全局優化,要么依賴於隨序列長度擴展性能不佳的簡單記憶機制。相比之下,STream3R引入了一種流式處理框架,利用因果注意力高效處理圖像序列,這一靈感來自現代語言建模的進展。通過從大規模三維數據集中學習幾何先驗,STream3R能夠很好地泛化到多樣且具有挑戰性的場景,包括傳統方法經常失敗的動態場景。大量實驗表明,我們的方法在靜態和動態場景基準測試中均一致優於先前的工作。此外,STream3R本質上兼容LLM風格的訓練基礎設施,使得針對各種下游三維任務的大規模預訓練和微調成為可能。我們的結果強調了因果Transformer模型在線三維感知中的潛力,為實時三維理解在流式環境中的應用鋪平了道路。更多詳情請訪問我們的項目頁面:https://nirvanalan.github.io/projects/stream3r。
English
We present STream3R, a novel approach to 3D reconstruction that reformulates
pointmap prediction as a decoder-only Transformer problem. Existing
state-of-the-art methods for multi-view reconstruction either depend on
expensive global optimization or rely on simplistic memory mechanisms that
scale poorly with sequence length. In contrast, STream3R introduces an
streaming framework that processes image sequences efficiently using causal
attention, inspired by advances in modern language modeling. By learning
geometric priors from large-scale 3D datasets, STream3R generalizes well to
diverse and challenging scenarios, including dynamic scenes where traditional
methods often fail. Extensive experiments show that our method consistently
outperforms prior work across both static and dynamic scene benchmarks.
Moreover, STream3R is inherently compatible with LLM-style training
infrastructure, enabling efficient large-scale pretraining and fine-tuning for
various downstream 3D tasks. Our results underscore the potential of causal
Transformer models for online 3D perception, paving the way for real-time 3D
understanding in streaming environments. More details can be found in our
project page: https://nirvanalan.github.io/projects/stream3r.