STream3R: Schaalbaar sequentieel 3D-reconstructie met causale Transformer

Samenvatting

We presenteren STream3R, een nieuwe benadering voor 3D-reconstructie die het voorspellen van puntenkaarten herformuleert als een decoder-only Transformer-probleem. Bestaande state-of-the-art methoden voor multi-view reconstructie zijn ofwel afhankelijk van kostbare globale optimalisatie of vertrouwen op simplistische geheugenmechanismen die slecht schalen met sequentielengte. In tegenstelling introduceert STream3R een streaming-framework dat beeldsequenties efficiënt verwerkt met behulp van causale aandacht, geïnspireerd door vooruitgang in moderne taalmodellering. Door geometrische priors te leren uit grootschalige 3D-datasets, generaliseert STream3R goed naar diverse en uitdagende scenario's, inclusief dynamische scènes waar traditionele methoden vaak falen. Uitgebreide experimenten tonen aan dat onze methode consistent beter presteert dan eerder werk op zowel statische als dynamische scène-benchmarks. Bovendien is STream3R inherent compatibel met LLM-stijl trainingsinfrastructuur, wat efficiënte grootschalige voorpretraining en fine-tuning mogelijk maakt voor diverse downstream 3D-taken. Onze resultaten onderstrepen het potentieel van causale Transformer-modellen voor online 3D-perceptie, wat de weg vrijmaakt voor real-time 3D-begrip in streaming-omgevingen. Meer details zijn te vinden op onze projectpagina: https://nirvanalan.github.io/projects/stream3r.

English

We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments. More details can be found in our project page: https://nirvanalan.github.io/projects/stream3r.

STream3R: Schaalbaar sequentieel 3D-reconstructie met causale Transformer

STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

Samenvatting

Support