RADIO-ViPE：动态环境中开放词汇语义SLAM的在线紧耦合多模态融合

摘要

我们提出RADIO-ViPE（多域统一视频姿态引擎）——一种在线语义SLAM系统，能够实现几何感知的开放词汇定位，将任意自然语言查询与动态环境中的局部化3D区域及物体相关联。与需要标定化位姿RGB-D输入的现有方法不同，RADIO-ViPE直接处理原始单目RGB视频流，无需预先获取相机内参、深度传感器或位姿初始化。该系统通过聚合式基础模型（如RADIO）获取的多模态嵌入特征（涵盖视觉与语言），与几何场景信息进行紧密耦合。这种耦合贯穿于初始化、优化和因子图连接过程，通过多模态一致性提升地图质量。优化过程采用自适应鲁棒核函数进行封装，可同时处理主动移动物体和智能体移位的场景元素（如以自我为中心会话过程中重新布置的家具）。实验表明，RADIO-ViPE在动态TUM-RGBD基准测试中达到最先进水平，同时与依赖标定数据和静态场景假设的离线开放词汇方法保持相当性能。该系统填补了现实世界部署的关键空白，为自主机器人和无约束野外视频流提供了鲁棒的开放词汇语义定位能力。项目页面：https://be2rlab.github.io/radio_vipe

English

We present RADIO-ViPE (Reduce All Domains Into One -- Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary grounding, associating arbitrary natural language queries with localized 3D regions and objects in dynamic environments. Unlike existing approaches that require calibrated, posed RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams, requiring no prior camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal embeddings -- spanning vision and language -- derived from agglomerative foundation models (e.g., RADIO) with geometric scene information. This coupling takes place in initialization, optimization and factor graph connections to improve the consistency of the map from multiple modalities. The optimization is wrapped within adaptive robust kernels, designed to handle both actively moving objects and agent-displaced scene elements (e.g., furniture rearranged during ego-centric session). Experiments demonstrate that RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. RADIO-ViPE bridges a critical gap in real-world deployment, enabling robust open-vocabulary semantic grounding for autonomous robotics and unconstrained in-the-wild video streams. Project page: https://be2rlab.github.io/radio_vipe

RADIO-ViPE：动态环境中开放词汇语义SLAM的在线紧耦合多模态融合

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

摘要

Support