EmbodiedSplat: Semantic 3DGS a Feed-Forward in Tempo Reale per la Comprensione di Scene 3D a Vocabolario Aperto

Abstract

Comprendere immediatamente una scena 3D durante la sua esplorazione è essenziale per i task embodied, in cui un agente deve costruire e comprendere la scena 3D in modo online e quasi in tempo reale. In questo studio, proponiamo EmbodiedSplat, un sistema 3DGS feed-forward online per la comprensione di scene a vocabolario aperto che consente la ricostruzione 3D online simultanea e la comprensione semantica 3D a partire da immagini in streaming. A differenza dei metodi 3DGS a vocabolario aperto esistenti, tipicamente limitati a impostazioni di ottimizzazione offline o per singola scena, i nostri obiettivi sono duplici: 1) Ricostruire il 3DGS con incorporamento semantico dell'intera scena da oltre 300 immagini in streaming in modalità online. 2) Essere altamente generalizzabile su scene nuove grazie al design feed-forward e supportare una ricostruzione semantica 3D quasi in tempo reale se combinato con modelli 2D real-time. Per raggiungere questi obiettivi, proponiamo un Campo di Coefficienti Sparsi Online con un Codebook Globale CLIP, che vincola gli embedding CLIP 2D a ogni Gaussiana 3D minimizzando il consumo di memoria e preservando la piena generalizzabilità semantica di CLIP. Inoltre, generiamo feature CLIP consapevoli della geometria 3D aggregando la nuvola di punti parziale del 3DGS tramite una U-Net 3D per compensare la mancanza di priors geometrici 3D negli embedding linguistici orientati al 2D. Esperimenti estensivi su diversi dataset di ambienti interni, tra cui ScanNet, ScanNet++ e Replica, dimostrano sia l'efficacia che l'efficienza del nostro metodo. Visita la nostra pagina del progetto all'indirizzo https://0nandon.github.io/EmbodiedSplat/.

English

Understanding a 3D scene immediately with its exploration is essential for embodied tasks, where an agent must construct and comprehend the 3D scene in an online and nearly real-time manner. In this study, we propose EmbodiedSplat, an online feed-forward 3DGS for open-vocabulary scene understanding that enables simultaneous online 3D reconstruction and 3D semantic understanding from the streaming images. Unlike existing open-vocabulary 3DGS methods which are typically restricted to either offline or per-scene optimization setting, our objectives are two-fold: 1) Reconstructs the semantic-embedded 3DGS of the entire scene from over 300 streaming images in an online manner. 2) Highly generalizable to novel scenes with feed-forward design and supports nearly real-time 3D semantic reconstruction when combined with real-time 2D models. To achieve these objectives, we propose an Online Sparse Coefficients Field with a CLIP Global Codebook where it binds the 2D CLIP embeddings to each 3D Gaussian while minimizing memory consumption and preserving the full semantic generalizability of CLIP. Furthermore, we generate 3D geometric-aware CLIP features by aggregating the partial point cloud of 3DGS through 3D U-Net to compensate the 3D geometric prior to 2D-oriented language embeddings. Extensive experiments on diverse indoor datasets, including ScanNet, ScanNet++, and Replica, demonstrate both the effectiveness and efficiency of our method. Check out our project page in https://0nandon.github.io/EmbodiedSplat/.

EmbodiedSplat: Semantic 3DGS a Feed-Forward in Tempo Reale per la Comprensione di Scene 3D a Vocabolario Aperto

EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding

Abstract

Support