RADIO-ViPE: Online Sterk Gekoppelde Multimodale Fusie voor Open-Vocabularium Semantische SLAM in Dynamische Omgevingen

Samenvatting

Wij presenteren RADIO-ViPE (Reduce All Domains Into One -- Video Pose Engine), een online semantisch SLAM-systeem dat geometriebewuste open-vocabulary grounding mogelijk maakt, waarbij willekeurige natuurlijke-taaluitdrukkingen worden gekoppeld aan gelokaliseerde 3D-gebieden en objecten in dynamische omgevingen. In tegenstelling tot bestaande benaderingen die gekalibreerde, geposeerde RGB-D-invoer vereisen, werkt RADIO-ViPE rechtstreeks op onbewerkte monocular RGB-videostreams, zonder voorafgaande kennis van camera-intrinsieken, dieptesensoren of pose-initialisatie. Het systeem koppelt multimodale embeddings – die visie en taal omspannen – afkomstig van agglomeratieve foundation-modellen (bijv. RADIO) nauw aan geometrische scène-informatie. Deze koppeling vindt plaats tijdens initialisatie, optimalisatie en factor graph-verbindingen om de consistentie van de kaart vanuit meerdere modaliteiten te verbeteren. De optimalisatie is verpakt in adaptieve robuuste kernels, ontworpen om zowel actief bewegende objecten als door de agent verplaatste scène-elementen (bijv. meubels die tijdens een egocentrische sessie worden herschikt) te verwerken. Experimenten tonen aan dat RADIO-ViPE state-of-the-art resultaten behaalt op de dynamische TUM-RGBD-benchmark, terwijl het competitieve prestaties handhaaft in vergelijking met offline open-vocabulary-methoden die vertrouwen op gekalibreerde data en statische scène-aannames. RADIO-ViPE overbrugt een kritieke kloof voor inzet in de praktijk, en maakt robuuste open-vocabulary semantische grounding mogelijk voor autonome robotica en onbeperkte in-the-wild videostreams. Projectpagina: https://be2rlab.github.io/radio_vipe

English

We present RADIO-ViPE (Reduce All Domains Into One -- Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary grounding, associating arbitrary natural language queries with localized 3D regions and objects in dynamic environments. Unlike existing approaches that require calibrated, posed RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams, requiring no prior camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal embeddings -- spanning vision and language -- derived from agglomerative foundation models (e.g., RADIO) with geometric scene information. This coupling takes place in initialization, optimization and factor graph connections to improve the consistency of the map from multiple modalities. The optimization is wrapped within adaptive robust kernels, designed to handle both actively moving objects and agent-displaced scene elements (e.g., furniture rearranged during ego-centric session). Experiments demonstrate that RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. RADIO-ViPE bridges a critical gap in real-world deployment, enabling robust open-vocabulary semantic grounding for autonomous robotics and unconstrained in-the-wild video streams. Project page: https://be2rlab.github.io/radio_vipe

RADIO-ViPE: Online Sterk Gekoppelde Multimodale Fusie voor Open-Vocabularium Semantische SLAM in Dynamische Omgevingen

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

Samenvatting

Support