RADIO-ViPE：動態環境中開放詞彙語義SLAM的線上緊密耦合多模態融合技術

摘要

我們提出 RADIO-ViPE（融合多域於一體——影片姿態引擎），這是一款線上語義SLAM系統，能實現幾何感知的開放詞彙定位，將任意自然語言查詢與動態環境中局部化的3D區域及物件建立關聯。與現有需要標定化、帶姿態的RGB-D輸入的方法不同，RADIO-ViPE直接處理原始單目RGB影片串流，無需預先取得相機內參、深度感測器或姿態初始化。該系統緊密耦合源自聚合式基礎模型（如RADIO）的跨模態嵌入（涵蓋視覺與語言）與幾何場景資訊，此耦合機制貫穿初始化、優化及因子圖連接階段，以提升多模態地圖的一致性。優化過程封裝於自適應魯棒核函數內，專為處理主動移動物體與智能體位移的場景元素（例如在自我中心視角下被重新擺放的傢俱）而設計。實驗表明，RADIO-ViPE在動態TUM-RGBD基準測試中達到頂尖水準，同時在依賴標定數據與靜態場景假設的離線開放詞彙方法面前保持競爭力。RADIO-ViPE彌補了現實應用中的關鍵缺口，為自主機器人系統與無約束實境影片串流實現了強健的開放詞彙語義定位。專案頁面：https://be2rlab.github.io/radio_vipe

English

We present RADIO-ViPE (Reduce All Domains Into One -- Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary grounding, associating arbitrary natural language queries with localized 3D regions and objects in dynamic environments. Unlike existing approaches that require calibrated, posed RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams, requiring no prior camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal embeddings -- spanning vision and language -- derived from agglomerative foundation models (e.g., RADIO) with geometric scene information. This coupling takes place in initialization, optimization and factor graph connections to improve the consistency of the map from multiple modalities. The optimization is wrapped within adaptive robust kernels, designed to handle both actively moving objects and agent-displaced scene elements (e.g., furniture rearranged during ego-centric session). Experiments demonstrate that RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. RADIO-ViPE bridges a critical gap in real-world deployment, enabling robust open-vocabulary semantic grounding for autonomous robotics and unconstrained in-the-wild video streams. Project page: https://be2rlab.github.io/radio_vipe

RADIO-ViPE：動態環境中開放詞彙語義SLAM的線上緊密耦合多模態融合技術

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

摘要

Support