STream3R：基于因果Transformer的可扩展序列三维重建

摘要

我们提出了STream3R，一种创新的三维重建方法，它将点云图预测重新定义为仅解码器的Transformer问题。现有的多视图重建最先进方法要么依赖于昂贵的全局优化，要么采用扩展性差于序列长度的简单记忆机制。相比之下，STream3R引入了一种流式处理框架，借鉴现代语言建模的进展，利用因果注意力高效处理图像序列。通过从大规模三维数据集中学习几何先验，STream3R能很好地泛化到多样且具挑战性的场景，包括传统方法常失效的动态场景。大量实验表明，我们的方法在静态和动态场景基准测试中均持续超越先前工作。此外，STream3R天然兼容LLM风格的训练基础设施，支持针对多种下游三维任务的高效大规模预训练与微调。我们的成果凸显了因果Transformer模型在在线三维感知中的潜力，为流式环境下的实时三维理解铺平了道路。更多详情请访问我们的项目页面：https://nirvanalan.github.io/projects/stream3r。

English

We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments. More details can be found in our project page: https://nirvanalan.github.io/projects/stream3r.

STream3R：基于因果Transformer的可扩展序列三维重建

STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

摘要

Support