Geometrische Context Transformer voor Streaming 3D-reconstructie

Samenvatting

Streaming 3D-reconstructie heeft als doel 3D-informatie, zoals cameraposes en puntenwolken, te herstellen uit een videostream, wat geometrische nauwkeurigheid, temporele consistentie en computationele efficiëntie vereist. Geïnspireerd door de principes van Simultaneous Localization and Mapping (SLAM), introduceren wij LingBot-Map, een feedforward 3D-foundationmodel voor het reconstrueren van scènes uit streaminggegevens, gebouwd op een geometrische contexttransformer (GCT)-architectuur. Een onderscheidend aspect van LingBot-Map schuilt in het zorgvuldig ontworwen aandachtmechanisme, dat een ankercontext, een pose-referentievenster en een trajectgeheugen integreert om respectievelijk coördinaatverankering, dichte geometrische aanwijzingen en correctie van drift op lange termijn aan te pakken. Dit ontwerp houdt de streamingstatus compact terwijl een rijke geometrische context behouden blijft, waardoor stabiele, efficiënte inferentie mogelijk wordt met ongeveer 20 FPS bij invoer met een resolutie van 518 x 378 gedurende lange sequenties van meer dan 10.000 frames. Uitgebreide evaluaties op diverse benchmarks tonen aan dat onze aanpak superieure prestaties bereikt in vergelijking met zowel bestaande streaming- als op iteratieve optimalisatie gebaseerde benaderingen.

English

Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.

Geometrische Context Transformer voor Streaming 3D-reconstructie

Geometric Context Transformer for Streaming 3D Reconstruction

Samenvatting

Support